The easy and routine distribution of computer code in source and binary form, and the importance of those distributions, has engendered strongly-felt needs to identify code theft, code provenance, and the presence of malware. Each of these needs may be met, at least in part, by the ability to rapidly compare test code samples to a large library of reference code samples to detect reference samples similar to the test samples. In particular, there is a strong desire to recognize when incoming code binaries are variants of known examples of malware.
The most promising approaches to recognizing code similarity typically do so by transforming the code samples into streams of tokens. These tokens can represent source characters, words, functional names, op-codes, calls, or other features or can represent parallel features in code binaries. For example, some methods produce token streams of characters, similar to DNA streams, drawn from sections of code binaries. Other methods of code similarity detection may be based on token streams of op-codes. Still other methods operate based on streams of tokens drawn from source code, in which structural elements are reduced to single-letter tokens. Some solutions may describe methods based on tokens representing calls to system routines during runtime.
Concurrently, the internet has offered an explosion of text documents leading to a strongly-felt need to recognize similar passages of text for the purposes of detecting plagiarism in academic environments, establishing provenance, and reducing duplication. The most successful approaches to bulk detection of document similarity have also been based on converting document samples to token streams, with those tokens representing words or characters in the documents.
Despite the strongly-felt needs and considerable work in this area, conventional methods generally do not offer methods that achieve a processing speed and library capacity that can address the anticipated need for rapid bulk processing of input samples against a voluminous library of references. Moreover, conventional methods do not efficiently deal with finding matches that may be out of order in respective different sequences.
Accordingly, it may be desirable to continue to develop improved and/or more efficient mechanisms by which alignment between token sequences with block permutations may be provided.