Intrusion detection is a hot topic in the world of cyber security. Especially with artificial intelligence and machine learning as the new buzzwords it can be difficult for outsiders to know what to expect from them. In this article we aim to explain the fundamental basics of anomaly detection. In particular we discuss the strengths and limitations of such techniques, how far we can go towards full automation, and what you should keep in mind when using anomaly detection.
The main goal of many intrusion (or anomaly) detection systems is to discover activity in data (a.k.a. events) that stands “out of the ordinary” or is strange/unexpected. Of course, we can get very philosophical about what the true definition of an anomaly is but let us focus on some examples in practice. In practice, events are commonly found in areas such as healthcare, finance, security, telco, mobility and many more. Examples where intrusion detection turned out to be valuable for the latter two will be discussed later. First it is important to understand that there exist three types of anomalies that we can discover in data, namely: point, contextual, and collective anomalies.
Point anomalies are outliers that are strange with respect to your entire data collection. Imagine you have a login history of an employee Bob and Bob always logs in to the company from the office in the Netherlands. If Bob after 5 minutes would suddenly login from Uganda, this is strange with respect to his entire history.
Contextual anomalies are outliers that are strange with respect to a sub part of the data. Suppose Bob accesses contracts A and B in the following order: ABABABAB… although the order of the events seems fairly regular, if we would group the data for instance on the customers for which Bob is accessing these contracts we can see that he usually accesses contract A when working with customer A and contract B for customer B. The access of contract B for customer A however is uncommon and is considered anomalous with respect to that customer.
“We need to assist automated techniques with domain knowledge in order to avoid generating too many alerts.”
Although point anomalies are relatively easy to discover, the main challenge for fully automated solutions is to find the right split for the detection of contextual anomalies. If we have a data set with 100 columns, there are 2100 data splits possible for which we can find anomalies! How should an algorithm know which (combination of features) are more interesting than others? This is where human insights become vital.
Finally, collective anomalies are outliers that are not strange by themselves but can be strange if they happen together. In a combustion engine for example events such as “open gas valve” and “light fire” are not uncommon, but the order in which they happen matters a lot.
The contextual anomaly problem shows that discovering anomalies in general is not difficult at all. We can always find a viewpoint from where data can be seen as abnormal. We need to assist automated techniques with domain knowledge in order to avoid generating too many alerts. Besides context in practice there are many other challenges to tackle before we can reliably discover anomalies of interest.
Data bias: In order for anomaly detection to work, we need to have a notion of what is regular behaviour of a system. This requires a “training” phase where the system absorbs all the activity that for instance Bob is doing in order to get a better understanding of his daily way of working. This is under the assumption that the data observed during the training period is representative for Bob’s profile. If the training period is too short, we have too little data points to draw a reasonable conclusion (i.e. overgeneralization). A too long period, however, increases the risk that any abnormal behaviour is also captured in the “profile” of normal behaviour.
“We as humans are still invaluable when it comes to spotting anomalies and building models of expectation.”
False positives: As with all technology, anomaly detection is not perfect. Although it is perfectly valid for Bob to login from the United States when he is on a business trip, the system may not recognize this as normal. Such an event may be unfairly marked as an outlier and is also referred to as a false positive. Analogously there exists the class of false negatives: the number of times the anomaly detection does not trigger on something it should have.
Concept drift: Suppose that our employee Bob gets promoted to a function where he for instance needs to travel a lot to the United States, the profile that we once built for the user has become outdated and should not lead to false positives. Although throwing away his old profile is always an option, be aware that this can be costly if data is scarce or training phases are long.
We as humans are still invaluable when it comes to spotting anomalies and building models of expectation. Detection techniques can be significantly improved by enabling users to incorporate their insights in these techniques. In the end finding anomalies is not difficult. Finding the ones that matter is the challenge.