Most simple computer viruses work by copying exact duplicates of themselves to each executable program file they infect. When an infected program executes, the virus gains control of the computer and attempts to infect other files. If it locates a target executable file for infection, it copies itself byte-for-byte to the target executable file. Because this type of virus replicates identical copies of itself each time it infects a new file, the virus can be easily detected by searching in files for a specific string of bytes (i.e. a "signature") that has been extracted from the virus.
Simple (non-polymorphic) encrypted viruses comprise a decryption routine (also known as a decryption loop) and an encrypted viral body. When a program file infected with a simple encrypting virus executes, the decryption routine gains control of the computer and decrypts the encrypted viral body. The decryption routine then transfers control to the decrypted viral body, which is capable of spreading the virus. The virus is spread by copying the identical decryption routine and the encrypted viral body to the target executable file. Although the viral body is encrypted and thus hidden from view, these viruses can be detected by searching for a signature from the unchanging decryption routine.
Polymorphic encrypted viruses ("polymorphic viruses") comprise a decryption routine and an encrypted viral body which includes a static viral body and a machine-code generator often referred to as a "mutation engine." Initially, the operation of a polymorphic virus is similar to the operation of a simple (non-polymorphic) encrypted virus. When a program file infected with a polymorphic virus executes, the decryption routine gains control of the computer and decrypts the encrypted viral body. The decryption routine then transfers control of the computer to the decrypted viral body, which is capable of spreading the virus. However, the virus is spread by copying a newly generated decryption routine along with the encrypted viral body to the target executable file. The newly generated decryption routine is generated on the fly by the mutation engine. In many polymorphic viruses, the mutation engine generates decryption routines that are functionally the same for all infected files, but use different sequences of instructions to function. Common mutation strategies employed by the mutation engine include reordering of instructions, substituting equivalent instructions or equivalent sequences of instructions, and inserting instructions that have no effect on functionality. Because of these multifarious mutations, these viruses cannot be detected by simply searching for a signature from a decryption routine because each decryption routine may have a different signature.
In order to detect the growing number of polymorphic viruses, antivirus software companies are beginning to adopt emulator-based antivirus technology, also known as Generic Decryption (GD) technology. The GD scanner works in the following manner. Before executing a program suspected of being infected on the actual CPU (central processing unit) of the computer, the GD scanner loads the program into a software-based CPU emulator which acts as a simulated virtual computer. The program is allowed to execute freely within this virtual computer. If the program does in fact contain a polymorphic encrypted virus, the decryption routine is allowed to decrypt the viral body. The GD scanner can then detect the virus by searching through the virtual memory of the virtual computer for a signature from the decrypted viral body.
One problem encountered in implementing GD technology is reducing the number of instructions of a program that must be simulated before a determination of uninfected status can be reliably made. Generally, GD scanners use a set of rules to determine how long to simulate each program. For example, during the initial stage of the emulation, if the program appears to contains a decryption routine, then the GD scanner should simulate the program longer to give the virus a sufficient number of instructions in which to decrypt itself. Conversely, during the initial stage of the emulation, if the program appears strongly to be an uninfected (a "clean") program, then the GD scanner should abort emulation almost immediately.
Unfortunately, some uninfected programs have machine language instructions that look like decryption loops. In addition, some data files also contain binary data which may look like decryption loops, and in some operating systems, such as MS-DOS, data files cannot generally be distinguished from executable files. If the GD scanner detects a possible decryption loop in a program (or in a data file accessed by a program), then it should continue to simulate the program (or data file) until it reliably determines that the program is uninfected. This emulation may take many seconds and may potentially substantially inconvenience the computer user.
Thus, one motivation for the present invention is to develop GD technology that simulates as few instructions of a program (or data file) as possible before being able to reliably determine that it is uninfected. This goal is difficult to attain because the polymorphic decryption routine may take so many different forms and so can be difficult to identify without emulating a large number of instructions.
Another problem in implementing GD technology is avoiding redundant emulation of instructions for a program (or data file) that has been previously determined as uninfected. Frequently, users or programs access the same file over and over again. For example, a user may run the same electronic mail or word processing program many times during a computing session. Furthermore, these programs tend to repeatedly access the same data files. For instance, when the commonly used Lotus cc:mail program for Windows is first launched by the user, it may open and close the configuration file named "CCMAIL.CFG" twenty-eight separate times. If a GD-based real-time antivirus scanner is also being run, the GD scanner will typically repeatedly scan the CCMAIL.CFG file each of the twenty-eight times it is opened. In a typical case, each scan may take only several milliseconds, but it may take several seconds if the file contains data that looks like a decryption loop. This multiple second delay would compound into a very unacceptable several minute delay if the file was rescanned twenty-eight times.
Thus, another motivation for the present invention is to develop GD technology that avoids the redundant emulation of instructions for those programs or data files that were previously determined to be uninfected.
Novell's NetWare software is a commonly used network operating system which identifies each file on the server by a unique identification number. The current version of the Norton Anti-Virus (NAV) software which is used in conjunction with NetWare utilizes a cache to store identification numbers of those files on a server that have previously been determined by scanning to be virus free. If the identification number of a target file is in the cache, the NAV software avoids the redundant rescanning of the file.
However, many operating systems, including Windows 3.1 and Windows 95, do not have unique numbers to identify each file. For such operating systems, filenames, instead of file identification numbers, may be stored in a cache. But filenames may be hundreds of bytes in length in modern operating systems, such as in Windows 95, and indexing by such long filenames is not economical of storage space. Moreover, in order to maintain such a cache, the antivirus software must monitor all requests to modify the files whose filenames are currently in the cache. If a file whose filename is in the cache is modified, the filename must be removed from the cache. Such monitoring complicates and slows down the antivirus software.