One of the most common techniques for detecting a computer virus in a given program is to scan the machine-level representation of that program (i.e., representation by bytes) for patterns that are present in a set of known viruses, and unlikely to be found in normal, uninfected programs. Typically, when a new virus appears, a human expert analyzes it, and selects a pattern of bytes in the virus that is deemed to be unlikely to occur by chance in uninfected programs. (This pattern is referred to as a "signature" for the virus.) The new signature must then be distributed to the customers that use that anti-virus software. It can take several months before a substantial fraction of the customers receive the update. Thus, anti-virus software that relies on particular knowledge of specific viruses can lag substantially behind tale( discovery of a new virus.
One solution to this problem is to construct a generic virus detector that flags programs that contain features that are deemed to be virus-like. The generic classification of a program as containing a virus (or not) is bound to be somewhat less certain than classification based on signatures for specific known viruses, but if the rate of false positives (the fraction of the time that the program falsely accuses an uninfected program of being infected) is very low, it can be an invaluable tool.
The current state of the art in generic virus detection is for a human expert to identify a number of features that are present, or conceivably could be present, in many viruses. The occurrence frequency of these features in executable data is taken as input to a classifier (also designed by the human expert) that classifies the executable data as "infected" or "not infected". If the executable data is classified as "infected", the classifier may attempt to make finer distinctions, placing the putative virus into one of a number of generic virus families.
The method of constructing a generic virus detector purely through the use of human expertise is deficient for at least two reasons. First, the number of different viruses in existence (in fact, just those that operate in a DOS environment) numbers in the thousands--making it extremely difficult for humans to develop a full list of characteristic viral features. Second, given a large number of features taken from a large number of viruses, it is very difficult for a human to construct a classifier that combines these features into an optimal decision as to whether a particular file contains a virus. Thus, there is presently a need for an automated method for constructing a generic virus detector that would either supplement or supplant human expertise. Heretofore, however, the development of such an automatic method has generally been thought to be either very difficult or impossible.
The field of generic virus detection is just one of many contexts in which it is desirable to construct automatically a classifier capable of distinguishing among two or more classes of data strings. Another example is in the field of reverse engineering of software. Often, software is written in a high-level language such as C, FORTRAN, etc., and then compiled into a machine-level binary representation. In some situations, e.g., checking for patent infringement or analyzing a virus, one would like to reverse this procedure, i.e., obtain source code from the machine code. In order to do this, one would need to know which compiler had been used in the original conversion from high-level source to machine-level binary code, and then use a de-compiler that had been specifically constructed for that particular compiler.
One essential part of this procedure is to determine from the machine code of a program the compiler that was used to generate it. This is feasible because each compiler typically generates machine code in a fairly idiosyncratic way that is in principle distinguishable from that of other compilers. In many situations, humans are able to determine the compiler simply by looking for text strings imbedded in the machine code that identify the compiler. However, if the programs author were deliberately trying to hide illegal or immoral activity such as patent infringement or virus writing, he might intentionally modify or eliminate text strings that indicate the compiler. In this case, determination of the compiler must rely on identification of machine-code features that are specific to that compiler, but not to other compilers. It would be very difficult for a human to be familiar with particular machine-code features that are peculiar to one compiler or another. Thus, there is a need for a method of automatically constructing a classifier capable of determining the compiler that was used to generate a given machine-code representation of a program.
These examples, taken from the fields of generic virus detection and reverse software engineering, are illustrative of the general need for an automatic method for constructing data-string classifiers.