1. Field of the Invention
This invention relates to systems and methods for detecting anomalies in a computer system, and more particularly to an architecture and data format for using a central data warehouse and heterogeneous data sources.
2. Background
As sensitive information is increasingly being stored and manipulated on networked systems, the security of these networks and systems has become an extremely important issue. Intrusion detection systems (IDSs) are an integral part of any complete security package of a modern, well managed network system. An IDS detects intrusions by monitoring a network or system and analyzing an audit stream collected from the network or system to look for clues of malicious behavior.
Many widely used and commercially available IDSs are signature-based systems. As is known in the art, a signature-based system matches features observed from the audit stream to a set of signatures hand crafted by experts and stored in a signature database. Signature-based methods have some inherent limitations. For example, a signature-based method is designed to only detect attacks for which it contains a signature in the database. Therefore, the signature-based methods cannot detect unknown attacks since there is no signature in the database for them. Such unknown attacks can be dangerous because the system is completely vulnerable to them. In addition to the expense in time and human expertise of manually encoding a signature for each and every known attack
Data mining-based methods are another paradigm for building intrusion detection systems. The main advantage of these methods is that they leverage the generalization ability of data mining methods and in order to detect new and unknown attacks. Data mining IDSs collect data from sensors which monitor some aspect of a system. Sensors may monitor network activity, system calls used by user processes, or file system access. They extract predictive features from the raw data stream being monitored to produce formatted data that can be used for detection. Machine learning and data mining algorithms are used on a large set of such data (e.g., “training data”) to build detection models. New data (e.g., “sensor data”) gathered by sensors is evaluated by a detector using the detection model. This model determines whether or not the sensor data is intrusive. These models have been proven to be very effective. (See, W. Lee, S. J. Stolfo, and K. Mok, “Data Mining in Work Flow Environments: Experiences in Intrusion Detection,” Proceedings of the 1999 Conference on Knowledge Discovery and Data Mining (KDD-99), 1999; and Christina Warrender, Stephanie Forrest, and Barak Pearlmutter, “Detecting Intrusions Using System Calls: Alternative Data Models,” Proceedings of the 1999 IEEE Symposium on Security and Privacy, pages 133–145. IEEE Computer Society, 1999).
These algorithms are generally classified as either misuse detection or anomaly detection. Misuse detection algorithms model known attack behavior. They compare sensor data to attack patterns learned from the training data. If the sensor data matches the pattern of some known attack data, the observed data is considered intrusive. Misuse models are typically obtained by training on a large set of data in which the attacks have been manually labeled (See, W. Lee, S. J. Stolfo, and K. Mok. Data mining in work flow environments: Experiences in intrusion detection. In Proceedings of the 1999 Conference on Knowledge Discovery and Data Mining (KDD-99), 1999.) This data is very expensive to produce because each piece of data must be labeled as either normal or some particular attack.
Anomaly detection algorithms learn a model of normal activity by training on a set of normal data. Anomaly detection models compare sensor data to normal patterns learned from the training data. Anomaly detection algorithms then classify as an attack activity that diverges from this normal pattern based on the assumption that attacks have much different patterns than do normal activity. In this way new unknown attacks can be detected. (See, e.g., D. E. Denning, “An Intrusion Detection Model,” IEEE Transactions on Software Engineering, SE-13:222–232, 1987; T. Lane and C. E. Brodley, “Sequence Matching and Learning in Anomaly Detection for Computer Security,” Proceedings of the AAAI-97 Workshop on AI Approaches to Fraud Detection and Risk Management, pages 43–19. Menlo Park, Calif.: AAAI Press, 1997; Christina Warrender, Stephanie Forrest, and Barak Pearlmutter, “Detecting Intrusions Using System Calls: Alternative Data Models,” Proceedings of the 1999 IEEE Symposium on Security and Privacy, pages 133–145. IEEE Computer Society, 1999; and T. Lane and C. E. Brodley, “Temporal Sequence Learning and Data Reduction for Anomaly Detection,” Proceedings of the Fifth ACM Conference on Computer and Communications Security, pages 150–158, 1998.) Anomaly detection models are popular because they are seen as a possible approach to detecting unknown or new attacks. Most of these algorithms require that the data used for training is purely normal and does not contain any attacks. This data can be very expensive because the process of manually cleaning the data is quite time consuming. Also, some algorithms require a very large amount of normal data which increases the cost.
As discussed above, data mining-based IDSs have their own disadvantages. Data to train the models is costly to generate. The data must be collected from a raw audit stream and translated into a form suitable for training. In addition, for misuse detection, each instance of data-must be labeled either normal or attack. In the case of anomaly detection, each instance of data must be verified to be normal network activity.
Since data-mining based IDSs in general do not perform well when trained in one environment and deployed in another, this process of preparing the data must be repeated at every deployment of data mining-based IDS system. Furthermore, for each type of audit data that is to be examined (network packets, host event logs, process traces, etc.) the process of preparing the data needs to be repeated as well. Because of the large volumes of data that needs to be prepared, the deployment of a data mining-based IDS system involves a tremendous amount of manual effort.
Many of parts of these manual proms can be automated, including the collection and aggregation of the data and translating it into a form appropriate for training the data mining-based detection models. In addition, many of these processes are the same across types of audit data. Some of the processes still require some manual intervention such as labeling the data, but even these can be semi-automated.
The work most similar to adaptive model generation is a technique developed at SRI in the Emerald system. (See, e.g., H. S. Javitz and A. Valdes, “The NIDES Statistical Component: Description and Justification,” Technical Report, SRI International, 1993.) Emerald uses historical records to build normal detection models and compares distributions of new instances to historical distributions. Discrepancies between the distributions signify an intrusion. One problem with this approach is that intrusions present in the historical distributions may cause the system to not detect similar intrusions in unseen data.
Related to automatic model generation is adaptive intrusion detection, Teng et al. perform adaptive real time anomaly detection by using inductively generated sequential patterns. (See, H. S. Teng, K. Chen and S. C. Lu, “Adaptive Real-Time Anomaly Detection Using Inductively Generated Sequential Patterns,” Proceedings of the IEEE Symposium on Research in Security and Privacy, pages 278–284, Oakland, Calif., May 1990.) Also relevant is Sobirey's work on adaptive intrusion detection using an expert system to collect data from audit sources. (See, M. Sobirey, B. Richter and M. Honig, “The Intrusion Detection System Aid, Architecture and Experiences In Automated Audit Analysis,” Proc. of the IFIP TC6/TC11 International Conference on Communications and Multimedia Security, pages 278–290, Essen, Germany, 1996.)
Many different approaches to building anomaly detection models have been proposed. A survey and comparison of anomaly detection techniques is given in Christina Warrender, Stephanie Forrest and Barak Pearlmutter, “Detecting Intrusions Using System Calls: Alternative Data Models,” Proceedings of the 1999 IEEE Symposium on Security and Privacy, pp. 133–145, IEEE Computer Society, 1999. Stephanie Forrest presents an approach for modeling normal sequences using look ahead pairs (See, Stephanie Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstaf, “A Sense of Self For UNIX Processes,” Proceedings of the 1996 IEEE Symposium on Security and Privacy, pp. 120–128, IEEE Computer Society, 1996) and contiguous sequences (See, S. A. Hofmeyr, Stephanie Forrest, and A. Somayaji, “Intrusion Detection Using Sequences of System Calls,” Journal of Computer Security, 6:151–180, 1998). Helman and Bhangoo present a statistical method to determine sequences which occur more frequently in intrusion data as opposed to normal data. (See, P. Helman and J. Bhangoo, “A Statistically Base System for Prioritizing Information Exploration Under Uncertainty,” IEEE Transactions on Systems, Man and Cybernetics. Part A: Systems and Humans, 27:449–466, 1997.) Lee et al. uses a prediction model trained by a decision tree applied over the normal data. (See, W. Lee and S. J. Stolfo, “Data Mining Approaches For Intrusion Detection,” Proceedings of the Seventh USENIX Security Symposium, 1998; and W. Lee, S. J. Stolfo, and P. K. Chan, “Learning Patterns From UNIX Processes Execution Traces For Intrusion Detection,” Proceedings of the AAAI-97 Workshop on AI Approaches to Fraud Detection and Risk Management, pages 50–56. Menlo Park, Calif.: AAAI Press, 1997.) Ghosh and Schwartzbard use neural networks to model normal data. (See, Anup Ghosh and Aaron Schwartzbard, “A Study in Using Neural Networks for Anomaly and Misuse Detection,” Proceedings of the Eighth USENIX Security Symposium, 1999.) Lane and Brodley examine unlabeled data for anomaly detection by looking at user profiles and comparing the activity during an intrusion to the activity under normal use. (See, e.g., T. Lane and C. E. Brodley, “Sequence Matching and Learning in Anomaly Detection for Computer Security,” Proceedings of the AAAI-97 Workshop on AI Approaches to Fraud Detection and Risk Management, pages 43–19. Menlo Park, Calif.: AAAI Press, 1997; T. Lane and C. E. Brodley, “Temporal Sequence Learning and Data Reduction for Anomaly Detection,” Proceedings of the Fifth ACM Conference on Computer and Communications Security, pages 150–158, 1998; and T. Lane and C. E. Brodley, “Temporal Sequence Learning and Data Reduction for Anomaly Detection,” ACM Transactions on Information and System Security, 2:295–331, 1999.)
In intrusion data representation, related work is the IETF Intrusion Detection Exchange Format project (“Internet Engineering Task Force: Intrusion Detection Exchange Format,” http://www.ietf.org/html.charters/idwg-charter.html, 2000) and the CIDF effort (S. Staniford-Chen, B. Tung and D. Schnackenberg. “The Common Intrusion Detection Framework (CIDF)”, Proceedings of the Information Survivability Workshop, October 1998).
The challenge in automating these processes is the need to support different types of data and different types of detection models. In a typical network environment there are many different audit streams that are useful for detecting intrusions.
What is needed is an architecture to automate the processes of data collection, model generation and data analysis, and to solve many of the practical problems associated with the deployment of data mining-based IDSs.