1. Field of the Invention
This invention relates to systems and methods detecting anomalies in the operation of a computer system, and more particularly to a method of unsupervised anomaly detection.
2. Background
Intrusion detection systems (IDSs) are an integral part of any complete security package of a modern, well managed network system. The most widely deployed and commercially available methods for intrusion detection employ signature-based detection. These methods extract features from various audit streams, and detect intrusions by comparing the feature values to a set of attack signatures provided by human experts. Such methods can only detect previously known intrusions since these intrusions have a corresponding signature. The signature database has to be manually revised for each new type of attack that is discovered and until this revision, systems are vulnerable to these attacks.
Due to the limitations of signature-based detection, development has proceeded on two major approaches, or paradigms, for training data mining-based intrusion detection systems: misuse detection and anomaly detection. In misuse detection approaches, each instance in a set of data is labeled as normal or intrusion and a machine-learning algorithm is trained over the labeled data. For example, the MADAM/ID system, as described in W. Lee, S. J. Stolfo, and K. Mok, “Data Mining in Work Flow Environments: Experiences in Intrusion Detection,” Proceedings of the 1999 Conference on Knowledge Discovery and Data Mining (KDD-99), 1999, extracts features from network connections and builds detection models over connection records that represent a summary of the traffic from a given network connection. The detection models are generalized rules that classify the data with the values of the extracted features. These approaches have the advantage of being able to automatically retrain intrusion detection models on different input data that include new types of attacks.
Traditional anomaly detection approaches build models of normal data and detect deviations from the normal model in observed data. Anomaly detection applied to intrusion detection and computer security has been an active area of research since it was originally proposed by Denning (see, e.g., D. E. Denning. “An Intrusion Detection Model,” IEEE Transactions on Software Engineering, SE-13:222-232, 1987). Anomaly detection algorithms have the advantage that they can detect new types of intrusions, because these new intrusions, by assumption, will deviate from normal usage (see. e.g., D. E. Denning, “An Intrusion Detection Model,” cited above, and H. S. Javitz and A. Valdes, “The NIDES Statistical Component: Description and Justification,” Technical Report, Computer Science Laboratory, SRI International, 1993). In this problem, given a set of normal data to train from, and given a new piece of data, the goal of the algorithm is to determine whether or not that piece of data is “normal” or is an “anomaly.” The notion of “normal” depends on the specific application, but without loss of generality, normal means stemming from the same distribution. An assumption is made that the normal and anomalous data are created using two different probability distributions and are quantitatively different because of the differences between their distributions. This problem is referred to as supervised anomaly detection.
Some supervised anomaly detection systems may be considered to perform “generative modeling.” These approaches build some kind of a model over the normal data and then check to see how well new data fits into that model. A survey of these techniques is given in, e.g., Christina Warrender, Stephanie Forrest, and Barak Pearlmutter, “Detecting Intrusions Using System Calls: Alternative Data Models,” 1999 IEEE Symposium on Security and Privacy, pages 133-145. IEEE Computer Society, 1999. One approach uses a prediction model obtained by training decision trees over normal data (see., e.g., W. Lee and S. J. Stolfo, “Data Mining Approaches For Intrusion Detection,” Proceedings of the 1998 USENIX Security Symposium, 1998), while another one uses neural networks to obtain the model (see, e.g., A. Ghosh and A. Schwartzbard, “A Study in Using Neural Networks For Anomaly and Misuse Detection,” Proceedings of the 8th USENIX Security Symposium, 1999). Ensemble-based approaches are presented in, e.g., W. Fan and S. Stolfo, “Ensemble-Based Adaptive Intrusion Detection,” Proceedings of 2002 SIAM International Conference on Data Mining, Arlington, Va., 2002. Recent works such as, e.g., Nong Ye, “A Markov Chain Model of Temporal Behavior for Anomaly Detection,” Proceedings of the 2000 IEEE Systems, Man, and Cybernetics Information Assurance and Security Workshop, 2000, and Eleazar Eskin, Wenke Lee, and Salvatore J. Stolfo, “Modeling System Calls For Intrusion Detection With Dynamic Window Sizes,” Proceedings of DARPA Information Survivability Conference and Exposition II (DISCEX II), Anaheim, Calif., 2001, estimate parameters of a probabilistic model over the normal data and compute how well new data fits into the model.
A limitation of supervised anomaly detection algorithms is that they require a set of purely normal data from which they train their model. If the data contains some intrusions buried within the training data, the algorithm may not detect future instances of these attacks because it will assume that they are normal. However, in practice, labeled or purely normal data may not be readily available. Consequently, the use of the traditional data mining-based approaches may be impractical. Generally, this approach may require large volumes of audit data, and thus it may be prohibitively expensive to classify data manually. It is possible to obtain labeled data by simulating intrusions, but the detection system trained under such simulations may be limited to the set of known attacks that were simulated and new types of attacks occurring in the future would not be reflected in the training data. Even with manual classification, this approach is still limited to identifying only the known (at classification time) types of attacks, thus restricting detection to identifying only those types. In addition, if raw data were collected from a network environment, it is difficult to guarantee that there are no attacks during the time in which the data is collected.
Due to the limitations of traditional anomaly detection, there has been development of a third paradigm of intrusion detection algorithms, unsupervised anomaly detection (also known as “anomaly detection over noisy data”) as described in greater detail in E. Eskin, “Anomaly Detection Over Noisy Data Using Learned Probability Distributions,” Proceedings of the International Conference on Machine Learning, 2000, to address these problems. These algorithms take as input a set of unlabeled data and attempt to find intrusions buried within the data. In the unsupervised anomaly detection problem, the algorithm uses a set of data where it is unknown which are the normal elements and which are the anomalous elements. The goal is to recover the anomalous elements. After these anomalies or intrusions are detected and/or removed, a misuse detection algorithm or a traditional anomaly detection algorithm may be trained over the data. The goal is to recover the anomalous elements. The model that is computed and that identifies anomalies may be used to detect anomalies in new data, e.g., for online detection of anomalies in network traffic. Alternatively, after these anomalies or intrusions are detected and/or removed, a misuse detection algorithm or a traditional anomaly detection algorithm may be trained over the cleaned data.
In practice, unsupervised anomaly detection has many advantages over supervised anomaly detection. One advantage is that it does not require a purely normal training set. Unsupervised anomaly detection algorithms can be performed over unlabeled data, which is typically easier to obtain since it is simply raw audit data collected from a system. In addition, unsupervised anomaly detection algorithms can be used to analyze historical data to use for forensic analysis. Furthermore, an auditable system can generate data for use in a variety of detection tasks, including network packet data, operating system data, file system data, registry data, program instruction data, middleware application trace data, network management data such as management information base data, email traffic data, and so forth.
A previous approach to unsupervised anomaly detection involves building probabilistic models from the training data and then using them to determine whether a given network data instance is an anomaly or not, as discussed in greater detail in E. Eskin, “Anomaly Detection Over Noisy Data Using Learned Probability Distributions” (cited above). In this algorithm, a mixture model for explaining the presence of anomalies is presented, and machine-learning techniques are used to estimate the probability distributions of the mixture to detect the anomalies.
Another approach to intrusion detection uses distance-based outliers, and is discussed in greater detail in Edwin M. Knorr and Raymond T. Ng, “Algorithms For Mining Distance-Based Outliers in Large Datasets,” Proc. 24th Int. Conf. Very Large Data Bases, VLDB, pages 392-403, 24-27, 1998; Edwin M. Knorr and Raymond T. Ng, “Finding Intentional Knowledge of Distance-Based Outliers,” The YLDB Journal, pages 211-222, 1999; and Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jorg Sander, “LOF: Identifying Density-Based Local Outliers,” ACM SICMOD Int. Conf. on Management of Data, pages 93-104, 2000. These approaches examine inter-point distances between instances in the data to determine which points are outliers. However, this approach was not used in the field of intrusion detection, and therefore the analysis described in these references was not applied to detect anomalies.
A limitation of these approaches is derived from the nature of the outlier data. Often in network data, the same intrusion occurs multiple times. Consequently, there may be many similar instances in the data. Accordingly, a system which looks at the distances between data points may fail to detect several repeated intrusions as anomalies due to the relatively short distances between the data representing the multiple intrusions.
Accordingly, there exists a need in the art for a technique to detect anomalies in the operation of a computer system which can be performed over unlabeled data, and which can accurately detect many types of intrusions.