Detection of cyber intrusion attempts is a key component to cyber security. Current commercial anti-virus and intrusion detection software (IDS) rely largely on signature-based methods to identify malicious code before the malicious code can cause harm to computer systems. However, signature-based mechanisms are ineffective against zero-day exploits since the signature of zero-day malware is, by definition, unknown as the malware has not previously been identified as such.
Commercial IDSs, such as provided by McAfee, Symantec, or Sophos, rely on a signature-based approach to identifying malicious code. The signature, essentially a finger-print for malware, must already be known and deployed on the current system, usually through an anti-virus update or patch, for the IDS software to be able to detect the threat. This paradigm has several significant drawbacks:                The increasing rate at which new strains of malware are introduced means that ever increasing resources must be dedicated to generating, storing, and accessing malware signatures.        Even small alterations to existing malware render them invisible to signature detection, and        The very nature of the signature generation process dictates that zero-day malware will be invisible until a sample can be identified, isolated, and analyzed. Only until then can a signature be generated and pushed out to the intrusion detection systems.        
Consequently, the problem is that zero-day malware that has not been seen before must be identified as rapidly as possible while maintaining high accuracy by reducing both false negatives (amount of malware erroneously classified as not malware) and false positives (amount of non-malware erroneously classified as malware). Mechanisms must be developed that can identify zero-malware quickly and with high accuracy (including few false alarms).
Generally there are two broad types of automated malware detection systems: 1) Instance Matching (signature-based methods) and 2) Class Matching.
1) As discussed above, instance-matching (also called “template-matching”) detectors operate by memorizing and exactly matching byte patterns (a signature) within a specific instance of a malware. The resulting template is effective for identifying other exact instances of the same malware. Though conceptually simple to implement, as discussed above there are several major disadvantages to this methodology:
a. Many thousands of templates are needed to cover the entire malware domain.
b. Not effective against new (“zero-day”) threats because it takes time (on the order of many hours or days) to analyze the newly discovered threats and distribute effective templates to recognize them.
c. Instance-matching templates are “brittle” in the sense that malware authors can easily mitigate them by minor editing of the software codes. In fact, normal evolution of software often renders templates ineffective against new variants of the same malware codes.
2) Class-matching malware detectors are a fairly new development, designed to mitigate the shortcomings of instance-matching detectors. The main idea in class-matching malware detectors is to use machine-learning techniques to construct models that recognize entire classes of malware that share a common set of “features” such as specific sets of byte codes (“n-grams”) or the relative frequency of occurrence of key byte-patterns. These models consist of classification rules sets or decision trees which infer the malicious nature of a particular instance of software, based on the presence or absence of key byte code patterns. The models are derived from analysis of the features of known malicious and benign sets (the “training set”).
These models are more difficult to create but have several advantages over instance-matching detectors:
a. They can classify instances that were not in the training set, based on shared characteristic patterns, and, therefore, can be effective against zero-day threats.
b. The byte patterns tend to be very short and position independent and, therefore, are not as brittle as instance-matching templates.
c. Fewer models are required because each model can cover a broad set of instances.
The class-matching approach uses information theory and machine-learning techniques to identify general “features” of known malware through a “classifier” and to use the presence of these features to identify an unknown file as malware or not. This paradigm eliminates the need to know exactly what you are looking for in order to be able to find it. Specifically, the “classifier” is a decision tree based on “features” (n-grams, or sequences of n consecutive bytes; a good value for n is 4) present in either a binary file or in a system call or execution trace generated by execution of the file. The classifier is created by applying machine-learning algorithms (training) on a set of known malware and known benign-ware. Work on machine-learning based intrusion detection systems has generally only been pursued at the academic level. These academic approaches have generally used only a small set (less than 1,000 files) of malware to train on, yielding poor accuracy for a wide number of files.
Despite the advantages class-matching detectors have over instance-matching detectors, class-matching detectors also have problems. For example, class-matching detectors tend to have higher false-alarm rates because they rely on byte code patterns contained in training sets containing specific examples of malicious and benign software. Benign software with similar byte sequences to malicious software may be mistakenly classified as malicious. Since the classifiers generally return a probability that the file is malicious, the false alarm rate can be reduced, at the expense of the detection rate, by increasing the threshold above which a file is flagged as malicious. Instance matching techniques, by their very nature, are generally immune to false alarms. Class-matching detectors also have been extremely slow and time-consuming to operate, consequently ineffective in a commercial or practical setting.
Examples of known class-matching methods are described in Kolter, J. Z. and Mallof, M. A. “Learning to detect and classify malicious executables in the wild.” Journal of Machine Learning Research 7 (2006) (“Kolter-Maloof”), U.S. Pat. No. 8,037,535 to Maloof, U.S. Pat. No. 7,519,998 to Cai, U.S. Pat. No. 7,487,544 to Schultz et al., and U.S. P.G.Pub. No. 20090300765 to Moskovitch et al. These publications do not provide solutions to the above-described problems of high false-alarm rates or ineffectiveness and have only been demonstrated in academic settings.