Systems exist to detect (and thus eliminate) malware (e.g., viruses, worms, Trojan horses, spyware, etc.). Such malware detection systems typically work by using either static bit signatures and/or heuristics to identify malware. Static bit signature based malware detection involves identifying a specific bit-level pattern (signature) in known malware. Files are then scanned to determine whether they contain this signature. When malware is identified using static file signatures, the certainty of the conviction is high. However, signature based detection can be circumvented by changing content. Signatures have become less useful, as malware authors have become more sophisticated at manipulating their malware to avoid signature based detection.
Heuristic malware detection involves determining the likelihood of a given file being malware by applying various decision-based rules or weighing methods. Heuristic analysis can produce a useful result in many circumstances, but there is no mathematical proof of its correctness. In static file heuristics, the contents of the file is heuristically analyzed. In behavior based heuristics, the behavior of the program is heuristically analyzed. Both methods involve training a heuristic analyzer with a sample set of malware and clean files, so that it can make generalizations about the types of content or behaviors associated with each. Identifications of suspected malware using heuristic analysis can never, by definition, be entirely certain, as heuristic analysis only determines a likelihood of a file being clean or malicious. The confidence in heuristic based file convictions further suffers from the fact the training set is difficult to define, and is always different than the real world set.
One chief drawback of behavioral based malware detection is false positives. Due to the inherent uncertainty in heuristic analysis, the potential exists to convict a non-malicious file that appears to be acting in a malicious manner. Falsely classifying clean files as malicious is problematic, because it often results in legitimate, potentially important content being blocked. To address this problem, the aggressiveness of the heuristics used is often turned down, so as to lower the false positive rate. Unfortunately, dialing down the aggressiveness of the heuristics concomitantly causes the detected true positive rate to fall as well. In other words, by using weaker heuristics, malicious files are more likely to be falsely classified as being clean and passed through to users.
Tracking the reputations of sources from which electronic data originates is another technique used to identify malicious files. For example, the reputations of email addresses and domains can be tracked to identify trustworthy versus potentially malicious email senders and file signatures. Reputation based file classification can be effective when the source of a given file is well known. Where a lot of electronic content originates from a source over time, the reputation of that source can be confidentially evaluated and used to screen or pass through content. Unfortunately, reputation based file classification has difficulty confidently evaluating sources in the low prevalence range.
It would be desirable to address these issues.