Field
The present disclosure generally relates to computer security, and in particular to generating generic file signatures for detecting malicious software.
Description of the Related Art
Malicious software, sometimes called “malware,” is generally defined as software that executes on a computing system surreptitiously, or that has some surreptitious functionality. Malware can take many forms, such as parasitic viruses that attach to legitimate files, worms that exploit weaknesses in the computer's security in order to infect the computer and spread to other computers, Trojan horse programs that appear legitimate, but actually contain hidden malicious code, and spyware that monitors keystrokes and/or other actions on the computer in order to capture sensitive information or display advertisements. A wide variety of malicious software (malware) can attack modem computers. Malicious entities sometimes attack servers that store sensitive or confidential data that can be used to the malicious entity's own advantage. Similarly, other computers, including home computers, must be constantly protected from malicious software that can be transmitted when a user communicates with others via electronic mail, when a user downloads new programs or program updates, and in many other situations.
Conventional techniques for detecting malware, such as signature string scanning, are part of an overall computer security protection regime, but less effective against today's malware. Modern malware is often targeted and delivered to only a relative handful of computers. For example, a Trojan horse program can be designed to target computers in a particular department of a particular enterprise. Such malware might never be encountered by security analysts, and thus the security software might never be configured with signatures for detecting such malware. Mass-distributed malware, in turn, can contain polymorphisms that make every instance of the malicious software unique. As a result, it is difficult to develop signature strings that reliably detect all instances of the malware.
Newer techniques for detecting malware apply rules that make an inference about whether a target computer file is malicious by examining dynamic attributes of the target file, code or software. This type of malware detection uses a set of heuristics to make the inference based off dynamic file attributes and then generate signatures (sometimes called behavioral signatures) to identify malware. It should be noted that the terms “heuristic” or “heuristic algorithm” as used herein, generally refer to any type or form of algorithm, formula, model, or tool that may be used to classify or make decisions with respect to an object or sample.
The signatures are typically derived from decision trees developed using decision tree induction algorithms. Decision trees and other heuristics may be trained and refined using a corpus of known samples. As an example for detecting malware, a security-software vendor may train a malware detection heuristic by applying the heuristic to a corpus of samples containing known-malicious files and known-legitimate files. Known-legitimate files refer to software known to be non-malicious, and are sometimes referred to as “goodware.” Goodware can include common and/or popular software programs that are frequently present on a computer system.
The accuracy of a heuristic is often limited by the size of the corpus of samples used to train the heuristic. As such, heuristics may generate false negatives and/or false positives upon being deployed and used in the real world. The term “false positive” may represent an error made in rejecting a null hypothesis when the null hypothesis is actually true. For example, a malware-detection heuristic may produce a false positive by incorrectly determining that a legitimate file or software application is malicious. In order to improve the accuracy of a heuristic, heuristic providers typically: 1) add misclassified samples gathered from the field to the corpus of samples used to train the heuristic, 2) re-train the heuristic using the modified corpus of samples, and then 3) redeploy the re-trained heuristic. However, even if a heuristic is re-trained using a corpus of samples that includes misclassified samples gathered from the field, re-trained heuristics may produce new false positives upon being redeployed in the field. Because of this, heuristic providers may have to constantly redeploy and retest a heuristic until satisfactory performance is obtained.