Data mining and similar endeavors must analyze massive data sets generated by electronic information handling systems. One of the objectives of such endeavors may be to sift a high volume of existing data records or a stream of incoming records to flag those records that differ in some significant manner from the rest—that is, to identify any records that are anomalous when compared to other records in the dataset. These may also be called outliers. Data records may have a number of other names in various contexts, such as entries, files, messages, or packets.
Identifying anomalous records may be useful in a number of situations. An outlier in a communications network may indicate an attempted intrusion of the network. Credit-card purchases of expensive items in a short time period may indicate theft of the card. Unusual financial transactions may indicate money laundering. Sudden excessive temperatures in a building may suggest failure of the building's heating system. Consistently increasing size measurements of a manufactured product may point to cutting-tool wear. Anomalies are not necessarily harmful. A sudden increase in newspaper sales or Web-site accesses may indicate a breaking story.
Detecting anomalies differs from detecting clusters; these are not in general merely complementary tasks. The goal of cluster detection is to find sets of records that are similar to each other and not as similar to the rest of the records. Clusters of records are crisp when the similarity of close neighbors is much higher than their similarity to other records. Clusters are ill-defined when many pairwise similarities are high, and there is little distinction between nearest neighbors and other records. On the other hand, the goal of anomaly detection is to identify outlier records that are far away from other records in a dataset, whether or not those records display clusters. Well-defined anomalies show a clear distinction between how distant they lie from other records and how distant the other records are from each other. Anomalies are less well-defined when most of the pairwise distances lie in the same range, and the highest distance is not much larger than that range.
The simplest kind of anomaly is a deviation from a constant value of a single established norm, as in the case of cutting-tool wear. Their detection does not generally require complex algorithms or sophisticated measures. Problems increase when the norm is multi-modal, or when some of the modes are not previously known. In some scenarios, the modes may be time dependent; increasing traffic is not unexpected during a rush hour, yet it may be anomalous at other times.
Detection of anomalies also becomes harder when the data records have multiple features. Some anomalies may not exhibit out-of-the-ordinary behavior in any individual feature. For example, a height of 5 feet 7 inches and a weight of 80 pounds are not unusual separately, but they are anomalous when occurring together in the same person. Also, different feature may not be normalizable to the same scale; is a 5-year age difference comparable to a difference of $20,000 in annual income or not? Further, features might not even have numerical values; automobiles may come in categories such as red, blue, black, and green.
Models have been employed to detect anomalies or outliers in datasets. This approach, however, requires an explicit supervised training phase, and may require training sets free of outliers. Neural networks of several known types are available for this purpose. Regression models, possibly including basis functions, have been employed. Probabilistic models, perhaps including conditional probabilities, generally require a training set free of outliers. Bayesian networks may aggregate information from different variables to model causal dependencies among different properties of an event or record, may also incorporate external knowledge, or may create anomaly patterns along with normal patterns of properties. Pseudo-Bayes estimators may reduce false-alarm rates. Support-vector machines are learning machines capable of binary classification by hyperplanes, and may function in an unsupervised setting.
Clustering-based detection techniques find anomalies as a byproduct of the clustering algorithm. Although they need not be supervised and may operate in an incremental mode, such techniques are not optimized for finding outliers, they assume that the normal data points are exceedingly more numerous than the anomalous ones. In addition, they are computationally intensive, requiring pairwise distances between all data points.
Distance-based schemes employ some type of defined distance to measure similarity among data records. Schemes that measure pairwise distances are computationally intensive. Some perform poorly if the data has regions of differing density. When the data has a large number of features, the distribution is necessarily sparse in higher-dimensional space, so that the meaningfulness of distance becomes lost.