1. Field of the Invention
This invention relates to systems and methods for detecting malicious executable programs, and more particularly to the use of data mining techniques to detect such malicious executables in email attachments.
2. Background
A malicious executable is a program that performs a malicious function, such as compromising a system's security, damaging a system or obtaining sensitive information without the user's permission. One serious security risk is the propagation of these malicious executables through e-mail attachments. Malicious executables are used as attacks for many types of intrusions. For example, there have been some high profile incidents with malicious email attachments such as the ILOVEYOU virus and its clones. These malicious attachments are capable of causing significant damage in a short time.
Current virus scanner technology has two parts: a signature-based detector and a heuristic classifier that detects new viruses. The classic signature-based detection algorithm relies on signatures (unique telltale strings) of known malicious executables to generate detection models. Signature-based methods create a unique tag for each malicious program so  that future examples of it can be correctly classified with a small error rate. These methods do not generalize well to detect new malicious binaries because they are created to give a false positive rate as close to zero as possible. Whenever a detection method generalizes to new instances, the tradeoff is for a higher false positive rate.
Unfortunately, traditional signature-based methods may not detect a new malicious executable. In an attempt to solve this problem, the anti-virus industry generates heuristic classifiers by hand. This process can be even more costly than generating signatures, so finding an automatic method to generate classifiers has been the subject of research in the anti-virus community. To solve this problem, different IBM researchers applied Artificial Neural Networks (ANNs) to the problem of detecting boot sector malicious binaries. (The method of detection is disclosed in G. Tesauro et al., “Neural Networks for Computer Virus Recognition, IEE Expert, 11(4):5-6, August 1996, which is incorporated by reference in its entirety herein.) An ANN is a classifier that models neural networks explored in human cognition. Because of the limitations of the implementation of their classifier, they were unable to analyze anything other than small boot sector viruses which comprise about 5% of all malicious binaries.
Using an ANN classifier with all bytes from the boot sector malicious executables as input, IBM researchers were able to identify 80-85% of unknown boot sector malicious executables successfully with a low false positive rate (<1%). They were unable to find a way to apply ANNs to the other 95% of computer malicious binaries.
In similar work, Arnold and Tesauro applied the same techniques to Win32 binaries, but because of limitations of the ANN classifier they were unable to have the comparable accuracy over new Win32 binaries. (This technique is described in Arnold et al.,: Automatically  Generated Win 32 Heuristic Virus Detection,” Proceedings of the 2000 International Virus Bulletin Conference, 2000, which is incorporated by reference in its entirety herein.)
The methods described above have the shortcoming that they are not applicable to the entire set of malicious executables, but rather only boot-sector viruses, or only Win32 binaries.
The technique is similar to data mining techniques that have already been applied to Intrusion Detection Systems by Lee et al. Their methods were applied to system calls and network data to learn how to detect new intrusions. They reported good detection rates as a result of applying data mining to the problem of IDS. A similar framework is applied to the problem of detecting new malicious executables. (The techniques are described in W. Lee et al., “Learning Patterns From UNIX Processes Execution Traces for Intrusion Detection, AAAI Workshop in AI Approaches to Fraud Detection and Risk Management, 1997, pages 50-56, and W. Lee et al., “A Data Mining Framework for Building Intrusion Detection Models,” IEEE Symposium on Security and Privacy, 1999, both of which are incorporated by reference in their entirety herein.)
Procmail is a mail processing utility which runs under UNIX, and which filters email; and sorts incoming email according to sender, subject line, length of message, keywords in the message, etc. Procmail's pre-existent filter provides the capability of detecting active-content HTML tags to protect users who read their mail from a web browser or HTML-enabled mail client. Also, if the attachment is labeled as malicious, the system “mangles” the attachment name to prevent the mail client from automatically executing the attachment. It also has built in security filters such as long filenames in attachments, and long MIME headers, which may crash or allow exploits of some clients. 
However, this filter lacks the ability to automatically update its list of known malicious executables, which may leave the system vulnerable to attacks by new and unknown viruses. Furthermore, its evaluation of an attachment is based solely on the name of the executable and not the contents of the attachment itself.
Accordingly, there exists a need in the art for a technique which is not limited to particular types of files, such as boot-sector viruses, or only Win32 binaries, and which provides the ability to detect new, previously unseen files.