The present invention relates to categorizing computer files as malicious (malware) or benign (whiteware), and more particularly to the use of Bayesian statistical methods to classify such computer files.
In response to the demand for convenient communication and data exchange, the number of personal computers and the frequency of internet usage have increased steadily. Unfortunately, this growth has also increased the surface area exposed to sponsored and unsponsored hackers seeking to exploit vulnerabilities known to exist in applications and operating systems.
Normally, the motivation for exploitation is persistent access to—and control of—a personal computer by implanting and hiding malicious software, known as malware. Once this is accomplished, the malware is typically programmed to propagate throughout the sub-network which connects the initially infected computer to others. At any stage in this process, the malware almost always performs malicious actions such as accepting commands, instructing computers to carry out various tasks such as returning lists to enumerate processes, files, services, registry keys, etc. Other tasks include modifying, deleting, and sending files to the controller at the remote address. There are many such functions performed by malware. A typical modern malware sample is replete with enough functionality to accomplish almost any task normally associated with the job duties of a professional network administrator.
The networks of government organizations and large companies are vast. Such networks are comprised of an extremely large number of computers, running thousands of software packages. As time goes on and needs evolve, more and more software is introduced across hosts in the network. The surface area of exposure in the collective network increases in a controlled and easily measurable way in proportion to the number of servers added that host services and web sites accessible to users in numerous sub-nets. However, the surface area grows unpredictably in proportion to the number of individual computer users on the network, given their tendencies to respond to suspicious emails, their ill-advised use of the internet, and their decisions to download or install files and software from unscreened, questionable websites. Regardless of whether it is a planned, calculated risk or through the wide variance in human activity, the surface area of exposure to cyber threats is an increasing value, making it a harsh fact of life that malware finds its way onto even the most carefully protected networks.
Despite the pervasiveness of malware, vast controlled networks are also useful as they collectively constitute a very well controlled and precisely defined baseline. Taken together, the files across operating systems form a “whiteware” repository, providing enough files to generate very strong statistics for hundreds of properties associated with files of interest. These properties are called observables. Equally strong statistics may be computed from readily accessible malware repositories existing in both the public and the private domains.
Parent application Ser. No. 13/007,265, now U.S. Pat. No. 8,549,647, primarily discussed portable executable files. The methods described in that application are equally applicable to any computer file, both executable and otherwise. For example, malware may be found inside Adobe Acrobat and compressed Zip files.
Because the majority of computers are Windows based, most malware targets Windows Operating Systems. Although malware must adhere somewhat to Windows file formats, malware samples in the wild are frequently found to contain anomalies in fields and code/data buffers. These aberrations can be used to classify computer files as malware or whiteware.
Traditionally, a primary method which aims to identify malware is signature based malware scanners. Signature based scanners rely on the “signatures” (unique features or data strings) of known malware so that future examples of the same malware can be correctly classified. However, such detectors face a scaling problem in the form of signature proliferation, given the millions of malware variants. Consequently, signature-based detectors necessarily operate with high-maintenance databases requiring updates on a regular basis. These databases are normally extremely bloated, even when signatures are aged out. Moreover, signature-based detectors miss almost all malware files which are not already included in the signature base. This means that almost all malware will evade a signature based scanner on the day they are released.
As just described, a very serious problem with signature-based detectors is that they are inherently almost always out-of-date for new malware threats, and always out-of-date for so-called zero day malware threats. This is generally not a serious problem for individuals and commercial enterprises, as it usually takes time for new malware threats to propagate throughout the Internet. Many organizations, however, such as the military, are often specifically targeted and subject to zero day attacks. They require identifying malware before a specific malware threat has been identified, a signature determined, the signature added to a malware signature database and the new database propagated to computers across the Internet.
Accordingly, there exists a need for an approach devoid of signatures. One prior art method which aims to identify malware is anomaly detectors. Anomaly detectors are built from statistics based on observables belonging only to a repository of whiteware computer files. Anomaly detectors use the statistics from whiteware files in an attempt to identify malware by differences between the malware observables and whiteware statistics. However, when statistics are generated from a repository of uncontaminated whiteware, without any reference to malware, the approach has a high failure rate because there is too much overlap between properties and behaviors of whiteware and malware files.
As such, there exists a need for an approach that uses whiteware statistics like an anomaly detector, but factors in malware statistics as well. Unfortunately, prior art attempts have fallen short in quickly and accurately classifying files using statistical methods. Most prior art is limited by the tradeoff between speed of evaluation and accuracy of results. Some methods sacrifice accuracy for the sake of speed by evaluating only a few computer file observable features (or, simply, observables), thus having high false positive or negative rates. Other methods sacrifice speed for the sake of accuracy by evaluating many computer observables, but causing the evaluation of each file to take a substantial amount of time.
Additionally, while prior art has employed statistical methods based on Bayes Theorem, prior art employs only simple Naïve Bayes calculations or Multi-Naïve Bayes calculations. These calculations employed by prior art consider each computer observable separately from each other computer observable and assume that each observable is independent of all other observables. The Naïve Bayes implementation results in the loss of data which could be used if the true dependent relationship between observables and the full power of Bayes Theorem were harnessed. Accordingly, the Naïve Bayes and Multi-Naïve Bayes approach result in less accurate results.
Accordingly, there exists a need for a computer file classification method and apparatus based on a fully-functional Bayesian Inference Engine which employs statistics based on both malware and whiteware observables and does not use signatures as a method of classifying files.