1. Field of the Invention
The present invention generally relates to automatic analysis of computer viruses for the purpose of extracting from them information that is necessary for their detection and eradication and, more particularly, to a method of automatically deriving a virus' means for attaching to a host.
2. Description of the Prior Art
Whenever a new computer virus is discovered somewhere in the world, anti-virus software that checks for known viruses must be updated so as to detect the presence of the virus in infected programs and, possibly, to restore such programs to their original uninfected state. Traditionally, the only way to obtain information that permits detection and removal of the virus has been for human experts to analyze the viral code in minute detail, a procedure that is difficult and time-consuming.
The following description of how viruses typically infect host programs helps to explain what sort of information must be obtained in order to detect and remove computer viruses. Unlike biological viruses, which typically destroy their host cells, computer viruses have a vested interest in preserving the function of their host programs. Any computer virus that causes its host to malfunction would be likely to arouse a user's suspicion and thus bring about its own untimely demise. By far the easiest way for a virus author to ensure this, and the only way used in practice, is to keep the original code intact and add the virus code to it. More specifically, it is almost universal to have the virus code execute first, then pass control back to the victim program. (Because the victim code might terminate in a variety of places under a variety of conditions, it is more difficult to design a virus that runs after the victim, and we know of no cases where this has been done.) For this reason, an infected program usually contains the entire contents of the original file in some form. Almost universally, the infected program contains large contiguous blocks of code from the original host (perhaps with some rearrangement of the original order), interspersed with blocks of virus code. Some pieces of the original host may not appear explicitly but, instead, be encrypted and stored in data regions of the virus. Another important observation is that almost all viruses intersperse host and virus code very consistently, independent of the host, the operating environment, the virus' generation, etc.
Given these characteristics of typical viral infections, it is apparent that, in order to repair an infected program, one simply needs to know the locations of the pieces of the original host and how they ought to be joined to form the original. Additionally, in cases where portions of the host are imbedded, encrypted, in the virus, it is necessary to know where the imbedded bytes are, how they must be decrypted, and where in the reconstructed host they must be placed.
In order to recognize the presence of a particular virus in a program, one needs to know the locations of the one or more sections of viral code in the infected program, and what each section looks like. Describing the appearance of a viral section is more complicated than might first be supposed. For a variety of reasons, there are often regions within a virus that vary from one instance to another. Data regions are particularly volatile, as they may contain information specific to the particular time at which or environment in which they are created. A reasonable approach is to simply ignore such regions, and base recognition solely on invariant regions of the virus.
Another common source of variation is self-garbling; i.e., light-weight encryption techniques intended to avoid detection by virus scanners which use simple pattern matching. In this scheme, a large proportion of the virus is stored encrypted, its appearance governed by a variable key stored in a data region of the virus. The virus applies the appropriate decryption to its encrypted regions before those region are themselves executed. The fact that the virus is able to transform this "variable" region back into an executable, presumably invariant form, means that an invariant form exists, and can potentially be used to recognize that region of the virus. An "invariant" viral region can be described in terms of an invariant byte string, and the decryption procedure and key location--or key independent invariant function--that produces it from the original, encrypted region.
In brief, a virus can be described with accuracy sufficient to permit its detection and removal by characterizing
1. how it attaches itself to host programs, PA1 2. the form and location of its "invariant" regions, and PA1 3. the location and decryption of host bytes imbedded in the virus. Heretofore, the only method for obtaining such an intimate knowledge of the nature of the virus has been manual, tedious labor by a human expert, who examines the virus' machine code and perhaps looks at one or more samples of it, and then manually records the required information in a form suitable for use by anti-virus software. Anti-virus researchers and developers are finding themselves just barely able to keep up with the influx of several new computer viruses that are written every day by virus authors working around the clock and around the world. An automated method for characterizing viruses as described above is currently very desirable. Given that virus writers are starting to automate the process of creating new viruses, it may soon become absolutely essential. PA1 1. obtaining a set of "sample pairs", each sample pair consisting of a program infected with the virus and the corresponding original, uninfected program; PA1 2. generating a description of how the virus attaches to host programs; PA1 3. matching viral code across different samples to obtain a description of "invariant" regions of the virus; and PA1 4. locating within the other, variable regions of the virus any host bytes that may have been embedded there, perhaps after encryption.