1. Field of the Invention
The present invention relates to the field of computer security and specifically to the analysis of P-code and partially compiled computer programs of the type that execute within a run-time virtual environment, and more specifically to the detection of such programs that exhibit malicious or self-propagating behavior including computer viruses, network worms and Trojans.
2. Discussion of the Related Art
Detection of malicious programs has been a concern throughout the era of the personal computer. With the growth of communication networks such as the Internet and increasing interchange of data, including the rapid growth in the use of e-mail for communications, the infection of computers through communications or file exchange is an increasingly significant consideration. Infections take various forms, but are typically related to computer viruses, Internet or other network worms, Trojan programs or other forms of malicious code. Recent incidents of e-mail mediated attacks have been dramatic both for the speed of propagation and for the extent of damage, with Internet service providers (ISPs) and companies suffering service problems and a loss of e-mail capability. In many instances, attempts to adequately prevent file exchange or e-mail mediated infections significantly inconvenience computer users. Improved strategies for detecting and dealing with virus attacks are desired.
One conventional technique for detecting computer viruses (including Internet worms and Trojans) is signature scanning. Signature scanning systems use sample code patterns extracted from known malicious code and scan for the occurrence of these patterns in other program code. In some cases program code that is scanned is first decrypted through emulation, and the resulting code is scanned for signatures or function signatures (footprints). A primary limitation of this signature scanning method is that only known malicious code is detected, that is, only code that matches the stored sample signatures of known malicious code is identified as being infected. All viruses or malicious code not previously identified and all viruses or malicious code created after the last update to the signature database will not be detected. Thus, newly created viruses are not detected by this method; neither is malicious code in which the signature, previously extracted and contained in the signature database, has been overwritten.
In addition, the signature analysis technique fails to identify the presence of a virus if the signature is not aligned in the code in the expected fashion. Alternately, the authors of a virus may obscure the identity of the virus by opcode substitution or by inserting dummy or random code into virus functions. Nonsense code can be inserted that alters the signature of the virus to a sufficient extent as to be undetectable by a signature-scanning program, without diminishing the ability of the virus to propagate and deliver its payload. In addition, signature scanning fails where malicious programs have similar code structure to benign application programs. In such a case, the signature scanner will generate large numbers of false positives, or fail to detect the malicious code if the signature is abandoned.
An example of the signature scanner technique generating large numbers of false positives involves the analysis of malicious or potentially malicious code produced by a compiler that produces P-code or N-code. P-code or pseudocode is compiled and executable within a virtual machine environment. P-code is used in such languages as Java and is compiled to a form that is executable within an appropriate virtual machine in a host computer. N-code is partially compiled native code that requires a run-time environment for execution. Both P-code and N-code are executable within a virtual machine environment and the event procedures constructed by these compilers have a high degree of similarity whether the code is malicious or ordinary. Consequently, signature scanning tends to identify a large number of false positives for P-code and N-code programs.
Another virus detection strategy is integrity checking. Integrity checking systems extract a code sample from known, benign application program code. The code sample is stored, together with information from the program file such as the executable program header and the file length, as well as the date and time of the sample. The program file is checked at regular intervals against this database to ensure that the program file has not been modified. Integrity checking programs generate long lists of modified files when a user upgrades the operating system of the computer or installs or upgrades application software. A major disadvantage of an integrity check based virus detection system is that a great many warnings of virus activity issue when any modification of an application program is performed. It is difficult for a user to determine when a warning represents a legitimate attack on the computer system. Another drawback of the integrity checking method is that malicious code must modify other files to be detectable and the method therefore only works with computer viruses, not other forms of malicious code such as Internet worms and Trojan programs which do not alter other program files. Yet another disadvantage of the integrity checking method is that the virus has to be activated on the target system, that is, running in memory and performing its infection function on the target computer's files in order to be detectable, since changes to files only occur after the virus is activated.
Checksum monitoring systems detect viruses by generating a cyclic redundancy check (CRC) value for each program file. Modification of the program file changes the CRC value for that file and it is that change that indicates infection of the program file. Checksum monitors improve on integrity check systems in that it is more difficult for malicious code to defeat the monitoring. On the other hand, checksum monitors exhibit the same limitations as integrity checking in that the method generates many false positives.
Behavior interception systems detect virus activity by interacting with the operating system of the target computer and monitoring for potentially malicious behavior. When such malicious behavior is detected, the action is blocked and the user is informed that a potentially dangerous action is about to take place. The potentially malicious code can be allowed to perform this action by the user. This makes the behavior interception system somewhat unreliable, because the effectiveness of the system depends on user input. In addition, resident behavior interception systems are sometimes detected and disabled by malicious code.
Another conventional strategy for detecting infections is the use of bait files. This strategy is typically used in combination with other virus detection strategies to detect an existing and active infection. This means that the malicious code is presently running on the target computer and is modifying files. The virus is detected when the bait file is modified. Many viruses are aware of bait files and do not modify files that are either too small, obviously a bait file because of their structure or that have a predetermined content in the file name.
Another virus detection method is known as sand-boxing. This method is based on the fact that normal programs interact with the operating system through a set of predefined entry points referred to as API calls (application program interface calls). The API calls are made to procedures located in memory whose entry points are maintained by the operating system and stored in an API table. Such an API table is present in each program space created under the operating system. In the sand-boxing method, the API table is replaced (in the program's process space only) with an API table that consists of pointers to the anti-virus protection shell which then monitors each API call before passing the call to the real operating system API address. This method also has the drawback that the malicious code has to be activated on the target computer's platform before detection can take place. Another drawback of this method is that it works only for those programs that employ the documented manner of calling the system's API's. Many programs containing malicious code, including viruses, Internet worms and Trojans do not follow the standard convention and directly call the operating system at an address determined by scanning the operating system memory for an export table contained within the kernel 32 and other standard system DLLs. Such programs are capable of immediately infecting the target computer during the sand-box examination process.
It is apparent that improved techniques for detecting viruses and other malicious types of code are desirable.