A computer virus has been defined by Frederick B. Cohen as a program that can infect other programs by modifying them to include a, possibly evolved, version of itself (A Short Course on Computer Viruses, page 11).
As employed herein, a computer virus is considered to include an executable assemblage of computer instructions or code that is capable of attaching itself to a computer program. The subsequent execution of the viral code may have detrimental effects upon the operation of the computer that hosts the virus. Some viruses have an ability to modify their constituent code, thereby complicating the task of identifying and removing the virus.
Another type of undesirable software entity is known as a "Trojan Horse". A Trojan Horse is a block of undesired code that is intentionally hidden within a block of desirable code.
A widely-used method for the detection of computer viruses and other undesirable software entities is known as a scanner. A scanner searches through executable files, boot records, memory, and any other areas that might harbor executable code, for the presence of known undesirable software entities. Typically, a human expert examines a particular undesirable software entity in detail and then uses the acquired information to create a method for detecting it wherever it might occur. In the case of computer viruses, Trojan Horses, and certain other types of undesirable software entities, the detection method that is typically used is to search for the presence of one or more short sequences of bytes, referred to as signatures, which occur in that entity. The signature(s) must be chosen with care such that, when used in conjunction with a suitable scanner, they are highly likely to discover the entity if it is present, but seldom give a false alarm, known as a false positive. The requirement of a low false positive rate amounts to requiring that the signature(s) be unlikely to appear in programs that are normally executed on the computer. Typically, if the entity is in the form of binary machine code, a human expert selects signatures by transforming the binary machine code into a human-readable format, such as assembler code, and then analyzes the human-readable code. In the case where that entity is a computer virus, the expert typically discards portions of the code which have a reasonable likelihood of varying substantially from one instance of the virus to another. Then, the expert selects one or more sections of the entity's code which appear to be unlikely to appear in normal, legitimate programs, and identifies the corresponding bytes in the binary machine code so as to produce the signature(s). The expert may also be influenced in his or her choice by sequences of instructions that appear to be typical of the type of entity in question, be it a computer virus, Trojan horse, or some other type of undesirable software entity.
However, the accelerating rate at which new viruses, and new variations on previously-known viruses, are appearing creates a heavy burden for human experts. Furthermore, the efficacy of virus scanning is impaired by the time delay between when a virus is first introduced into the world's computer population and when a signature capable of recognizing the virus is distributed to an appreciable fraction of that population.
It is thus an object of this invention to provide an automatic computer implemented procedure for extracting and evaluating computer virus signatures.
It is further object of this invention to provide a statistical computer implemented technique for automatically extracting signatures from the machine code of a virus and for evaluating the probable effectiveness of the extracted signatures for identifying a subsequent instance of the virus.
It is another object of this invention to provide a statistical computer implemented technique for automatically evaluating computer virus signatures that have been preselected by a manual or other procedure.