1. Field of the Invention
The present invention relates to comparison, classification and search of executable versions of computer programs. More particularly, the invention relates to comparison, classification and search of executable versions of computer programs, which may include viruses, worms, or other malicious programs.
2. Description of Related Art
The task of measuring the similarity of computer programs is encountered in many program comparison and analysis contexts. Such contexts include, but are not limited to: evaluating the similarity of programs for the purpose of copyright or patent infringement analysis; finding similar programs or program fragments to aid in software development and software assessment; classifying programs as malicious (e.g., as viruses or worms) by comparing them to a database of known malicious or benign programs; creating classification schemes for programs, such as a classification scheme for virus families; and tracking the evolution of programs, such as the evolution of malicious worm code as new variants are released.
There are known techniques for assessing similarity of documents, genomic information, files, and computer programs. Many of the techniques in computer program comparison and genomic comparison work by measuring the similarity of the sequences of characters or tokens. For instance, the standard UNIX utility “diff” can be used to calculate similarity of two files by measuring the differences; “diff” measures difference by finding the longest common subsequence of tokens (text lines, in this case) and then denoting differences in terms of adds, deletions, and changes.
Many of the techniques used for documents work by comparing the frequencies of the terms or features found within the documents. A common feature used, for example, is an “n-gram”, which is merely a sequence of n characters or tokens. Given two documents the frequencies of features such as n-grams can be compared and used as a basis of similarity. For instance, the feature frequencies can be interpreted as vectors in a multi-dimensional space.
Techniques for detecting plagiarized and “cloned” parts of programs have also used the above techniques (primarily the sequence-based approaches), as well as methods that compare extracted metrics or extracted structures such as the control-flow graph structures.
Much of the past work specifically for program comparison has worked at the level of source code; fewer work at the level of assembly, bytecode, or machine code. They may need to be adapted for use in this context. One example is a bytecode similarity technique that tokenizes the bytecode and then performs what is termed a p-match algorithm, which serves to match preferably long sequences of tokens. There is a need for improved techniques for comparing programs, particularly for methods that are not overly sensitive to minor ordering changes, and particularly at assembly, bytecode, or machine code level. Moreover, there is a need to make these techniques work in solving problems of search, classification, and phylogeny or classification system construction.
Consider first the phylogeny construction or classification system problem. Software evolves. Frequently the evolution and reuse of code is not recorded. The ancestry and origin of individual portions often cannot be researched. This is true in cases where source code is stolen or plagiarized without permission and without record. It also legitimately happens in large companies and organizations where older legacy software systems are involved. Virus writers rarely record which virus code they have reused, borrowed and modified.
It is often necessary to reconstruct the hereditary relationships between various programs. In biology the relationship graph for species is called the phylogeny. The analogy in software is a software phylogeny. Thus a practical problem encountered frequently is how to find copied but changed code in potentially large bodies of software, and to do so efficiently, and then how to construct models of the phylogenies of the various pieces of software. For instance in software copyright litigation cases it is critical to be able to trace which code was taken from where. In virus defense it is necessary to recognize which viral code has been seen before. There have been attempts to build phylogenies for malwares. In some attempts an n-gram based phylogeny for a collection of computer viruses was developed using a directed acyclic graph whose nodes are the viruses and whose edges map ancestors to descendants and satisfy the property that each code fragment is “invented” only once. These methods assume that if one virus is based on another, long substrings of the ancestor, say 20 bytes or more, will appear in the descendent. In some methods a call flow graph—based similarity approach is developed for clustering malwares.
However, all of the above approaches have shortcomings. One particular sub-problem in this space is doing this for program binaries, bytecodes, or assembly files, i.e., the compiled or executable forms of programs. Virus writers try to hide the phylogeny relationships by several techniques, namely, variable renaming, code encapsulation, code reordering, garbage insertion, and instruction substitution. So, simple n-gram analysis may not detect those twists, e.g., code reordering in the code. Also calls may be obfuscated through instruction substitution making the call flow graph based similarity approach fail in many cases. The remaining difficulty is being able to recognize code that has been changed after it was borrowed, and use these matches to build the true relationship graph between programs.
Similar needs exist for search and classification. Many program search and classification techniques are known. For instance there currently exists a product from Black Duck Software that searches for matching programs on the basis of extracted information. According to their literature, these work on program binaries primarily by matching hash values for the whole binary, rather than matching portions in ways that can account for the possible changes mentioned above for phylogeny generation. One example of program classification technique is to try a variety of classifiers using linking information, ASCII strings or binary n-grams as features upon which the classification is to be made. This technique fails to account for simple variations, such as the different use of registers in two program binaries.
Whether by design or by accident, the prior malware comparison methods have taken approaches that reduce reliance on sequencing information. Methods to compare or align sequences and strings are important tools for molecular phylogenetics. Techniques such as suffix trees, edit distance models, and multiple alignment algorithms are staples for comparing genetic information. These sorts of techniques have been applied to benign computer programs as well, including program texts at the source level, machine level, and in-between. Commercial anti-virus (AV) scanners are also known to use some types of sequence matching in order to classify programs into fine-grained categories (Win32.Evol.A, Win32.Netsky.B, etc.). It is not believed they are substantially similar to suffix trees, edit distances, and the like. Although those methods are known in bioinformatics they appear not to be widely used for the purpose of classification or phylogeny model generation for malware.
On the one hand, sequence-based methods may work well for phylogeny model generation when sufficient numbers of sequences are preserved during evolution. Consider, for instance, the two worms named I-Worm.Lohack.{a,b} (the notation X.{y,z} is a shorthand for the sequence X.y,X.z) which we obtained from VX Heavens, the widely available malware collection. Both worms are 40,960 bytes long and differ on only some 700 bytes (less than 2%). While these two particular programs share large blocks of common bytes, it cannot be assumed that all related malware will. Nonetheless, if, in practice, related malware families maintain sufficient numbers of common sequences then phylogeny models generated based on the sequence commonalities may be satisfactory.
On the other hand, many sequence-based methods may not work well for malware if it has evolved through significant code shuffling and interleaving. Signature-based AV scanners have been known to identify malware by searching for particular sequences. This fact is likely to motivate malware authors to destroy easily identifiable sequences between releases so that they can avoid detection. The ability of AV scanners to detect these sequences is likely to have prompted the emergence of polymorphic and metamorphic malware. Some polymorphic and metamorphic malware—such as Win32.ZPerm and WM/Shuffle.A—permute their code during replication. Recognizing the self-constructed derivatives will be difficult if these permutations are not accounted for. It is reasonable to expect that permutation and reordering will continue to be one of the methods in the malware authors' toolbox.
A common technique in text processing is to use n-grams as features for searching, comparing, and machine learning. An n-gram is simply a string of n characters occurring in sequence. In using n-grams for malware analysis, the programs are broken down into sequences of n characters which, depending upon the granularity desired and definitions used, could be raw bytes, assembly statements, source lexemes, lines, and so on. As n decreases towards 1, the significance of sequence information is reduced.
In addition to n-grams, other features have been used to generate heuristic classifiers. This collection of past research has demonstrated promising abilities for automatically generating heuristic classifiers that can perform the binary classification decision of separating malicious programs from benign ones. However the record does not indicate how well these techniques would do at finer-grained classifications needed for specimen identification (i.e., naming). While some of these methods may perform accurate classification, there is a concern as to whether the methods will generalize if packed or encrypted versions of both malicious and benign programs are used in training or test data. A packer will compress valid executables into a compressed segment and a short segment containing standard unpacking code. Both benign and malicious executables will have similar unpacking codes, but will differ on the compressed portions. The compressed portions will have high entropy and, in fact, tend towards resembling random data. Any n-gram matches of bytes from such sections are likely to be accidental. Thus any comparisons or classification decisions made on the basis of n-gram matches are likely to be based primarily on matches to the decompressing segment, which will be common to both benign and malicious code, and will fail to properly distinguish the two classes