There are several well accepted methods for detecting computer viruses in memory, programs, documents or other potential hosts that might harbor them. One popular method, employed in most anti-virus products, is called "scanning".
A scanner searches potential hosts for a set of one or more (typically several thousand) specific patterns of code called "signatures" that are indicative of particular known viruses or virus families, or that are likely to be included in new viruses. A signature typically consists of a pattern to be matched, along with implicit or explicit auxiliary information about the nature of the match, and possibly transformations to be performed upon the input data prior to seeking a match to the pattern. The pattern could be a byte sequence to which an exact or inexact match is to be sought in the potential host. More generally, the pattern could be a regular expression. The auxiliary information might contain information about the number and/or location of allowable mismatched bytes. It might also restrict the match in various ways; for example the match might be restricted to input data representing computer programs in the .EXE format, and a further restriction might specify that matches only be declared if they occur in a region within one kilobyte on either side of the entry point. The auxiliary information may also specify transformations; for example, the input data might need to be transformed by XORing adjacent bytes together prior to scanning for the indicated byte sequence. (This permits patterns to be located in data that have been encrypted by XORing each byte with any one-byte key; a detailed description can be found in U.S. Pat. No. 5,442,699, entitled "Searching for patterns in encrypted data", issued to William C. Arnold et al. on Aug. 15, 1995.) Other examples of transformations that are in common usage today include running an emulator on the input program to encourage a polymorphic virus to (virtually) decrypt itself prior to scanning for patterns, and parsing a Microsoft Word document to unscramble the macro data prior to scanning for macro viruses.
Typically, a scanner operates by first loading signature data for one or more viruses into memory, and then examining a set of potential hosts for matches to one or more signatures. If any signature is found, further action may be taken to warn the user of the likely presence of a virus, and in some cases to eradicate the virus.
As the number of known computer viruses is rapidly growing beyond 10,000, the problem of storing signatures and associated information in memory is becoming increasingly acute. This is especially so for the DOS operating system, which normally allots just 640 kilobytes for all programs and data in system memory, including virus signatures, the anti-virus program code, and anti-viral Terminate and Stay Resident processes, as well as any other unrelated code and data that may have been loaded into the system memory.
One possible solution to the memory shortage in DOS-based systems is for the anti-virus program to use a DOS extender, which permits programs to use more than 640 kilobytes of memory if such "extended memory" is present on the computer. However, although many of today's PCs have extended memory, DOS extenders can slow down the operation of an anti-virus program significantly. In any case, even in operating systems other than DOS, it is still desirable to minimize memory usage, provided that this can be done without significantly degrading the speed of the scanner.
Detection of computer viruses is but one example of the general problem of determining whether a given data string possesses any of a given set of traits. A data string is a sequence of bytes that represent information in a computer, such as a program or document. For data strings that represent computer programs, one example of a trait is the property of being infected with the Jerusalem virus. The set of traits could pertain to all known viruses or some subset of them. A second example of a trait of computer programs is the property of having been compiled by a particular compiler. For data strings that represent text, examples of data traits include the property that the text is in a particular language, such as French, or that it contains information about a particular topic, such as "sports" or "performances of Handel oratorios".
A common method for detecting data traits in data strings is to search for a set of patterns that are indicative of those traits. For computer viruses, the patterns are typically computer virus signatures as described above. The compiler used to generate a given body of machine code is a trait that can also be identified using signatures. To identify languages or subject areas, a text can be scanned for sets of keywords, and the occurrence frequencies of those keywords or of approximate matches to them may then be used to infer which if any traits of are present. Quite generally, there is a mapping from the located occurrences of the patterns to a (possibly empty) set of inferred data traits. The mapping may or may not take into account the locations of the occurrences within the data string. The mapping may be one-to-one, one-to-many, or many-to-one. For example, in computer virus applications, the mapping is approximately one-to-one, but some signatures occur in several viruses, and sometimes several signatures are used to identify a single virus.
Regardless of the particulars of the application, it is often the case that efficient detection of data traits within a data string requires a simultaneous search for a large number of patterns within that data string. In such cases, the memory required to store the patterns and any auxiliary data can be significant. It is generally desirable for any detector of data traits to be as efficient in all aspects as possible, and it is especially important for the detector to use as little computer memory as possible.