As the use of computers and networks continues to escalate, and the exponential growth of the Internet, it appears that malicious virus attacks on connected computers also is growing at an alarming rate. A serious virus attack can easily and quickly destroy or corrupt all the files on an infected computer.
The Internet has created unprecedented opportunities in the access to, and the sharing of information. Information exchange runs over a large range, and is carried out continuously and ubiquitously. The primary method of sharing information on the Internet is through email. While originally intended as convenient tool for text messages, email has evolved into backbone of the Internet. It has become the primary medium not only for communicating ideas, opinions, advertising and scheduling, but also for unauthorized access and malicious attacks. For example, a malicious executable program attached to what appears to be a benign email can be sent to millions of recipients. With only one or two mouse clicks, the program can severely damage computer systems and associated networks.
These unwanted programs include improperly obtaining access privileges (known as Trapdoor), obtaining private or sensitive information (known as Covert Channel), exhausting system resources (known as Worm), and infecting resident programs (known as Virus). Some malicious programs contain all of the preceding actions.
There are several ways to determine whether a program can perform malicious functions. In one case, a program being screened can be compared with a known “clean” copy of the program. Known malicious codes can be detected by virus scanners or compared against a set of verification rules serving as malicious code filters. In another method, dynamic analysis combines the concept of testing and debugging to detect malicious activities by running a program in a clean-room environment.
However, some malicious programs may elude detection because they do not match any known signatures. This may be due to its signature deviating from known signatures, or it may contain new signatures that have not been previously encountered. As hackers continuously create and modify malicious programs, it becomes necessary to not only to detect malicious programs that exactly match known signatures, but also to detect new signatures having similar features to the known signatures.
In a previous attempt to address this problem, Kephart and his colleagues at the High Integrity Computing Laboratory at IBM®, used statistical methods to automatically extract computer virus signatures. Their process was to identify the presence of a virus in an executable file, a boot record, or memory using short identifiers or signatures, which consist of sequences of bytes in the machine code. A good signature is one that is found in every object infected by the virus, but is unlikely to be found if the virus is not present. Later, researchers from that same group successfully developed a neural network based anti-virus scanner to detect the boot sector virus. However, due to system limitations at the time, it was difficult to extend the neural network classifier to detect other types of viruses other than the boot sector virus, a relatively small, but critical, percentage of all computer viruses.
Others have used data mining techniques to analyze a large set of malicious executables instead of only boot-sector, or Win32 binaries. System resource information, embedded text strings and byte sequences as features extracted from executables were utilized. Three learning algorithms were used to classify the data: (1) an inductive rule-based learner that generates Boolean rules based on feature attributes; (2) a Naïve Bayes classifier that estimates posterior probabilities that a testing file is in a class given a set of features; and (3) a multi-classifier system that combines the outputs from several Naïve Bayes classifiers to generate an overall prediction. The results showed two classifiers, the Naïves Bayes classifier (using text string as features) and multi-Naïves Bayes classifiers (using only byte sequences), outperformed all the other methods in terms of overall performance measured by detection rate and false positive rate.
In a recent study conducted by the applicants herein, the performance of six different feature classification methods against Naïves Bayes, entropy, and product classifiers in distinguishing benign and malicious binaries was compared. Afterwards, an investigation was conducted as to whether extending the features to multiple-byte sequences could improve the classification accuracy. Finally, Support Vector Machine (SVM) classifiers, with different kernel functions, were applied and their performance was compared. The rationale for choosing byte sequences as candidate feature is that those byte patterns are the most accessible and reliable information that represents the machine code in an executable. However, using embedded text strings as features, such as head information, program names, authors' names, or comments is not robust since they can be easily changed. Some malicious executables intentionally camouflage these signatures by randomly generating these fields to deceive virus scanners.
The majority of existing virus detection technologies can be categorized into the following three types based on their detection methods. The first is Signature Scanning. The approach used in these programs allows them to examine executable files on a computer for known virus code fragment within their contents. The main disadvantage to this approach is that the scanner cannot detect a new virus if the database does not contain the virus definition. The second is Heuristic Analysis. This method checks objects by analyzing the instruction sequences in the objects' contents. If an instruction sequence matches an instruction sequence of a known virus, and alarm is raised. Presently, this approach produces a large number of false alarms, and, therefore, is not practically applicable for operational environments. Virus writers responded to this method by implementing various techniques, such as encryption and polymorphism, allowing viruses to deceive these heuristic analyzers. The third is Behavior Modeling, a new research direction followed by the applicants herein. This method seeks to establish a general profile in separating benign and malicious codes. In comparison with the aforementioned two methods, this method has the capability not only to detect known viruses, but also their variants, and unknown viruses. Such programs can achieve high protection against attacks from new viruses. However, the unresolved challenge is to lower their false positive rate to an acceptable level.
The present invention solves these problems of the prior art by providing reliable malicious code detection using these byte sequence frequencies. It outperforms existing scanning technologies in achieving high accuracy in detecting known viruses and their unknown variants and in maintaining a very low false positive rate.