It is known in the art that each day, many tens of thousands of new malicious software programs are discovered. These programs can compromise the security of general computing devices. Possible security violations include, but are not limited to, the theft of data from the system, the usurping of the system for other nefarious purpose (like sending spam email), and, in general, the remote control of the system (by someone other than its owner) for other malicious actions.
One popular technique in the art for detecting malicious software comprises the following steps:                a. Establishing through some independent means that the application is malicious (e.g., by having a human being manually analyze it and pinpoint the presence of one or more malicious behaviors).        b. Computing a hash or fingerprint of this software. A hash is a mathematical transformation that takes the underlying binary contents of a software application and produces a relatively short string, with the idea being that two different applications will, with overwhelmingly high probability, have distinct fingerprint values. Common functions for performing this fingerprinting or hashing step include, but are not limited to, SHA-256, SHA-1, MD5, and others. Besides hash and fingerprint, another term used in the art to describe this transformation is a signature. For the purposes of this invention, the terms hash, fingerprint and signature will be used interchangeably. These terms are not synonymous with each other, but for the purposes of the invention described, the differences are immaterial.        c. Publishing this hash so that it is accessible to end-users operating a general purpose computing device (for example, the hash can be posted to a blacklist of known malicious applications).        d. Having the device compare this published fingerprint with the fingerprint of any new software applications that have arrived on the system.        e. Applying a set of steps based on a given policy if the fingerprints match (e.g., blocking the installation of the application).        
The technique just described suffers from the drawback that it only works when an application is determined to be malicious ahead of time. Put differently, it is a reactive approach. It is understood in the art that often times superficial changes to a malicious application will cause it to have a different fingerprint even though the underlying actions of the application continue to be malicious. In other words, the application will look ostensibly different from the outside, but underneath its operations will be identical (analogous to how a criminal can put on different disguises involving wigs and sunglasses, even though underneath it is the same person). If the file is modified, then the corresponding fingerprint might change. If the fingerprint changes, then it will no longer match the one that was initially established for the application, and consequently the application can potentially evade detection by any anti-malware technology that uses a reactive signature-based approach.
The recent explosion in malware instances appears to be a result of malware authors making frequent, but innocuous, changes to a smaller number of applications rather than creating entirely new applications.
To address this issue, one technique in the art involves developing what are known as generic signatures. These signatures are designed to be invariant to superficial changes in the underlying binary contents of a software application. If a malicious party only performs a restricted set of superficial changes to the binary, then the resulting hash value will not change. For example, one way to construct a generic signature would be to do the following. First, extract out structural properties of the file (such as the sizes of the different sections, the number of symbols, the entropy of the various sections). Second, normalize these values or put them in buckets. For example, if the size is between 0 bytes and 100 bytes, then it would belong in bucket one. If the size is between 100 and 200 bytes, it would belong in bucket two, and so on. Now, rather than using the original file to construct a signature, we could use the normalized structural features as the basis of the signature. The idea is that superficial changes to the file would likely yield little to no changes to the underlying structure of the file, and after normalization or bucketing, you would see no changes.
Consequently, a single generic signature can be used not only to detect a given base threat, but also be used to detect minor variations of that threat. To give a physical analogy that might help make the concept of a signature more clear, imagine you are trying to describe a criminal. You could do so by identifying very specific characteristics (such as hair color, eye color, what they were wearing when last seen, etc.). However, if the criminal wore a wig or had colored contact lenses on, then characteristics like hair or eye color would not be useful. If instead, one were to focus on structural attributes, such as the criminal's height, weight, build, race, etc., then even in the presence of disguises these attributes would be constant. Furthermore, if one were to normalize these attributes (e.g., saying he is approximately 6 feet tall rather than exactly 6 feet and 2 inches, or saying the he is heavyset rather than specifying a very specific build), you could potentially identify the criminal even if they wore platform shoes and baggy clothing.
However, it is known in the art that even generic signatures have shortcomings. These shortcomings include, but are not limited to the following:                a. Creating generic signatures might require manual intervention. (For example, a human computer virus analyst may have to directly examine the binary contents of the software application and determine how a signature should be computed so that it is invariant to innocuous changes in the applications.) In the context of the human criminal analogy listed above, one might have to identify exactly which attributes are interesting, and what range of values they should take.        b. Generic signatures are prone to false positives (i.e., a situation in which they incorrectly identify an application as malicious, even though it is in fact benign). Since generic signatures are designed to identify not just a single base software application, but also other applications that are related to it, there is a risk that a legitimate application might inadvertently be identified as malicious because its underlying binary contents bear some similarity to the malicious application off of which the signature was based. In the context of the human criminal analogy given above, if we were too vague in the description—then every 6 foot tall heavy-set person might fit the description of the criminal.        
There is, accordingly, a need in the art to develop methods, components, and systems for detecting malicious software in a way that addresses the above limitations. The present invention addresses these needs by providing a) an improved method for using generic signatures by using automation to reduce the amount of manual analysis and the risk of false positives in the system, b) a method of using contextual information, such as the presence of other recent (malicious) activity on a system, to formulate a more accurate picture regarding whether or not a particular software application running on the system might be malicious, c) a method of using machine learning technologies to train a corpus to develop a machine learning model for the evaluation of applications of interest, and d) methods including two or more of methods (a) through (c).