1. Field of the Invention
This invention relates to systems and methods for detecting anomalies in a computer system, and more particularly to the use of probabilistic and statistical models to model the behavior of processes which access the file system of the computer, such as the Windows™ registry.
2. Background
Windows™ is currently one of the most widely used operating systems, and consequently computer systems running the Windows™ operating system are frequently subject to attacks. Malicious software is often used to perpetrate these attacks. Two conventional approaches to respond to malicious software include virus scanners, which attempt to detect the malicious software, and security patches that are created to repair the security “hole” in the operating system that the malicious software has been found to exploit. Both of these methods for protecting hosts against malicious software suffer from drawbacks. While they may be effective against known attacks, they are unable to detect and prevent new and previously unseen types of malicious software.
Many virus scanners are signature-based, which generally means that they use byte sequences or embedded strings in software to identify certain programs as malicious. If a virus scanner's signature database does not contain a signature for a malicious program, the virus scanner is unable to detect or protect against that malicious program. In general, virus scanners require frequent updating of signature databases, otherwise the scanners become useless to detect new attacks. Similarly, security patches protect systems only when they have been written, distributed and applied to host systems in response to known attacks. Until then, systems remain vulnerable and attacks are potentially able to spread widely.
Frequent updates of virus scanner signature databases and security patches are necessary to protect computer systems using these approaches to defend against attacks. If these updates do not occur on a timely basis, these systems remain vulnerable to very damaging attacks caused by malicious software. Even in environments where updates are frequent and timely, the systems are inherently vulnerable from the time new malicious software is created until the software is discovered, new signatures and patches are created, and ultimately distributed to the vulnerable systems. Since malicious software may be propagated through email, the malicious software may reach the vulnerable systems long before the updates are in place.
Another approach is the use of intrusion detection systems (IDS). Host-based IDS systems monitor a host system and attempt to detect an intrusion. In an ideal case, an IDS can detect the effects or behavior of malicious software rather than distinct signatures of that software. In practice, many of the commercial IDS systems that are in widespread use are signature-based algorithms, having the drawbacks discussed above. Typically, these algorithms match host activity to a database of signatures which correspond to known attacks. This approach, like virus detection algorithms, requires previous knowledge of an attack and is rarely effective on new attacks. However, recently there has been growing interest in the use of data mining techniques, such as anomaly detection, in IDS systems. Anomaly detection algorithms may build models of normal behavior in order to detect behavior that deviates from normal behavior and which may correspond to an attack. One important advantage of anomaly detection is that it may detect new attacks, and consequently may be an effective defense against new malicious software. Anomaly detection algorithms have been applied to network intrusion detection (see, e.g., D. E. Denning, “An Intrusion Detection Model, IEEE Transactions on Software Engineering, SE-13:222-232, 1987; H. S. Javitz and A. Valdes, “The NIDES Statistical Component: Description and Justification, Technical report, SRI International, 1993; and W. Lee, S. J. Stolfo, and K. Mok, “Data Mining in Work Flow Environments: Experiences in Intrusion Detection,” Proceedings of the 1999 Conference on Knowledge Discovery and Data Mining (KDD-99), 1999) and also to the analysis of system calls for host based intrusion detection (see, e.g., Stephanie Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff, “A Sense of Self for UNIX Processes,” IEEE Computer Society, pp. 120-128, 1996; Christina Warrender, Stephanie Forrest, and Barak Pearlmutter, “Detecting Intrusions Using System Calls: Alternative Data Models,” IEEE Computer Society, pp. 133-145, 1999; S. A. Hofineyr, Stephanie Forrest, and A. Somayaji, “Intrusion Detect Using Sequences of System Calls,” Journal of Computer Security, 6:151-180, 1998; W. Lee, S. J. Stolfo, and P. K. Chan, “Learning Patterns from UNIX Processes Execution Traces for Intrusion Detection,” AAAI Press, pp. 50-56, 1997; and Eleazar Eskin, “Anomaly Detection Over Noisy Data Using Learned Probability Distributions,” Proceedings of the Seventeenth International Conference on Machine Learning (ICML-2000), 2000).
There are drawbacks to the prior art approaches. For example, the system call approach to host-based intrusion detection has several disadvantages which inhibit its use in actual deployments. A first is that the computational overhead of monitoring all system calls is potentially very high, which may degrade the performance of a system. A second is that system calls themselves are typically irregular by nature. Consequently, it is difficult to differentiate between normal and malicious behavior, and such difficulty to differentiate behavior may result in a high false positive rate.
Accordingly, there is a need in the art for an intrusion detection system which overcomes these limitations of the prior art.