For many multiuser environments such as High Performance Computing (HPC) centers and cloud platforms, there is an increasing security-related need to know how those resources are being used. From preventing inefficient use of a capability to detecting unwanted or illegal codes, there is a spectrum of desired and undesired code that system maintainers should be cognizant of. The science of reliably developing and identifying signatures for diverse cyber datasets such as an executable software corpus is increasingly challenged by the rate, volume, and complexity of software that is available. New applications are coming online at an increasing rate as computation capabilities, network bandwidth, and compute cycles continue to increase according to Moore's Law. In particular, the challenge of software identity verification, or identifying what binaries are executing on a system at a given time is increasingly difficult as the number and complexity of applications continues to increase, as well as the number of variants of any given application. A binary is a file or code. Some binaries are functional without an installer.
Clone detection is an existing software analysis approach that could potentially be used to recognize highly similar variants of a binary family. Applications of clone detection are generally applied to large-scale code base software for the purpose of 1) finding and eliminating cut-and-paste segments in a large software projects because these are especially prone to introduce complexity and bugs, 2) identifying instances of software plagiarism, or 3) for making sure licensed code is free of open source code fragments or other software that would jeopardize a commercial license.
Clone detection is typically done either by analyzing source code, or by operating on the disassembled binary (e.g., the assembly instructions).
Detecting similar binaries directly is the target of many commercial offerings and research projects, most of which are based on code signatures. Typically these signatures are built from checksums or other transformations of the binary sequence into numerical representations where finding a match is equivalent to finding equal checksums. There are many variations on this theme including simplistic approaches where a single checksum is calculated for each binary. Such exact-matching methods are not suitable for recognizing binaries in a development environment, as the binaries should not be exact matches. Likewise in cloud environments, there may be so many (near identical) variants that exhaustively characterizing them beforehand is not practical. Exact-match based approaches fail because by adding a single nonsense instruction or changing a single data field (such as an internal author name or timestamp) results in an entirely unrelated checksum value. This is because in general checksums do not preserve similarity.
The main limitation of hash-based methods is that because hashing determines an exact match or no match (instead of a near match) for the segment being hashed, they have an inherent tradeoff between sensitivity and specificity. A hash of an entire binary will only match another binary if they are exact matches. Introducing a single meaningless instruction into one binary will change its hash, making it appear as an entirely distinct artifact. Hashing instead at the section level for the same example would produce a series of hashes that are the same and one that is different. However, a single trivial addition to each section would make all of them look distinct, again confounding the method. At the other end of the spectrum, graph similarity approaches are either computationally costly (and therefore not practical solutions for line-speed identification of clones), or sensitivity is sacrificed for speed.