The easy and routine distribution of computer code in source and binary form, and the importance of those distributions, has engendered strongly-felt needs to identify code theft, code provenance, and the presence of malware. Each of these needs may be met, at least in part, by the ability to rapidly compare test code samples to a large library of reference code samples to detect reference samples similar to the test samples. In particular, there is a strong desire to recognize when incoming code binaries are variants of known examples of malware.
The most promising approaches to recognizing code similarity typically do so by transforming the code samples into streams of tokens. These tokens can represent source characters, words, functional names, op-codes, calls, or other features or can represent parallel features in code binaries. For example, some methods produce token streams of characters, similar to DNA streams, drawn from sections of code binaries. Other methods of code similarity detection may be based on token streams of op-codes. Still other methods operate based on streams of tokens drawn from source code, in which structural elements are reduced to single-letter tokens. Some solutions may describe methods based on tokens representing calls to system routines during runtime.
Concurrently, the internet has offered an explosion of text documents leading to a strongly-felt need to recognize similar passages of text for the purposes of detecting plagiarism in academic environments, establishing provenance, and reducing duplication. The most successful approaches to bulk detection of document similarity have also been based on converting document samples to token streams, with those tokens representing words or characters in the documents.
In some examples, each sample is converted to a token stream, from which n-grams are extracted to form a signature of the sample. A library of references is formed by recording the signatures of reference samples, together with identifying information. To examine a test sample, its signature is constructed in a parallel manner, and the signature is compared to those in the library. References whose signatures are sufficiently similar to the signature of the test sample are reported as similar.
An n-gram is an n-long sequence of consecutive tokens drawn from a stream of tokens. Representing a token stream by its constituent n-grams makes a uniform basis of comparison of token streams, provides tolerance of small differences between token streams, and offers rapid computation. One can also easily represent n-grams by their numeric hash values, thereby saving space and providing a numeric index or key into tables for recording and look-up purposes. Accordingly, one can construct signatures of token streams using n-gram hash values, rather than using n-grams directly.
Despite the strongly-felt needs and considerable work in this area, conventional methods generally do not offer methods that achieve a processing speed and library capacity that can address the anticipated need for rapid bulk processing of input samples against a voluminous library of references.
Accordingly, it may be desirable to continue to develop improved and/or more efficient mechanisms by which protection against malware may be provided. Moreover, in some cases, the detection of related code variants in binaries outside the context of malware detection may also be useful.