1. Field of the Invention
This invention relates to methods of intrusion detection in a computer system, and more particularly, to techniques for generating intrusion detection models by artificially creating instances of data on a computer system.
2. Background Information
Many information survival systems, such as intrusion detection systems (IDSs) and credit card fraud detection systems, must be capable of detecting new and unknown patterns, or anomalies. At the same time, they must be able to efficiently adapt existing models when knowledge about new patterns becomes available.
Exemplary, novel techniques for intrusion detection are described in co-pending U.S. application Ser. No. 10/208,402 filed Jul. 30, 2002, entitled “System and Methods For Intrusion Detection With Dynamic Window Sizes,” U.S. application Ser. No. 10/208,432 filed Jul. 30, 2002, entitled “System and Methods For Detection of New Malicious Executables,” and U.S. application Ser. No. 10/222,632 filed Aug. 16, 2002, entitled “System and Methods For Detecting Malicious Email Transmission,” each of which is incorporated by reference in its entirety herein.
Data analysis tasks can be broadly categorized into anomaly detection and classification problems. “Anomaly detection” tracks events that are inconsistent with or deviate from events that are known or expected. For example, in anomaly detection systems are designed to flag observed activities that deviate significantly from established normal usage profiles. On the other hand, “classification systems” use patterns of well-known classes to match and identify known labels for unlabeled datasets. In intrusion detection, classification of known attacks is also called “misuse detection.” By definition, anomalies are not known in advance. Otherwise, they might be treated as a classification problem. Classification solves the problem of effectively learning from experience; however, anomaly detection discovers new knowledge and experience that may be used by classification after these anomalies are verified and established as new classes.
Anomaly detection systems are not as well studied, explored, or applied as classification systems. For IDSs, the DARPA evaluation results (further details are disclosed in MIT Lincoln Lab 1999, “1998 DARPA Intrusion Detection Evaluation, on line publication, http://www.ll.mit.edu/IST/ideval/index.html), one of the most authoritative competitions, showed that even the best IDSs fail to detect a large number of new and unknown intrusions. As new intrusion prevention and detection systems are deployed, it is expected that new attacks may be developed and launched. Misuse detection, or classification models, has limitations because it can only detect known attacks and their slight variations accurately,
In the current generation of classification models, training data containing instances of known classes is often available for training (or human analysis) and the goal is simply to detect instances of these known classes. Anomaly detection, however, relies on data belonging to one single class (such as purely “normal” connection records) or limited instances of some known classes with the goal of detecting all unknown classes. It may be difficult to use traditional inductive learning algorithms for such a task, as most are only good at distinguishing the boundaries among all given classes of data.
A major difficulty in using machine learning methods for anomaly detection lies in making the learner discover boundaries between known and unknown classes. Since the process of machine learning typically begins without any examples of anomalies in the training data (by definition of anomaly), a machine learning algorithm will only uncover boundaries that separate different known classes in the training data. This behavior is intended to prevent overfitting a model to the training data. Learners only generate hypotheses for the provided class labels in the training data. These hypotheses define decision boundaries that separate the given class labels for the sake of generality. They will not specify a boundary beyond what is necessary to separate known classes.
Some learners can generate a default classification for instances that are not covered by the learned hypothesis. The label of this default classification is often defined to be the most frequently occurring class of all uncovered instances in the training data. It is possible to modify this default prediction to be anomaly, signifying that any uncovered instance should be considered anomalous. It is also possible to tune the parameters of some learners to coerce them into learning more specific hypotheses. These methods typically do not yield a reasonable performance level.
Past research in anomaly detection in the intrusion detection domain has tended to focus on modeling user or program activities on a single host. For example, NIDES (Anderson, D. et al., “Next-Generation Intrusion Detection Expert System (NIDES): A Summary,” Technical Report SRI-CSL-95-07, Computer Science Laboratory, SRI International, Menlo Park, 1995) uses a number of statistical measures to construct user profiles. Forrest et al. (Forrest, S. et al., “A Sense of Self for UNIX Processes,” Proceedings of IEEE Symposium on Security and Privacy 1996) used short sequences of system calls made by programs to build normal profiles of process execution. However, many network-based attacks, e.g., the recent Distributed Denial of Service attacks on various Web sites, do not involve users or system programs on the victim hosts, and thus render anomaly detection models based on user and program activities less effective.
Therefore there is a need to develop anomaly detection models for classification of network activities and to classify previously unknown anomalies.