U.S. Pat. No. 6,792,545, assigned to the Assignee of the present application, discloses a system and method for performing secure investigations of networked devices over a computer network. Part of the investigation might entail locating a document in a target device that is identical to a reference document. This might be accomplished, for example, by computing a hash value of the reference document and then comparing the computed hash value against the hash value of the documents in the target device. Two documents with matching hash values are deemed identical to one another.
One way in which a malicious user of the target device may frustrate the use of cryptographic hashing to locate files of interest is by making minor alternations to the files. Changing even a single bit of a file changes its cryptographic hash. Thus, a forensic investigation system that uses a set of known cryptographic hashes for locating matching files is unsuccessful if an otherwise identical file has data that has been inserted, modified, or deleted.
One way to address this problem is by using a fuzzy hashing algorithm such as, for example, a fuzzy hashing algorithm known as “ssdeep.” In general terms, fuzzy hashing constructs hash signatures of chunks of data whose boundaries are determined by the context of the input. The hashes are then used to compute a numerical difference, usually expressed as a percentage, between the two files to which the fuzzy hashing algorithm was applied.
One drawback in using fuzzy hashing for locating almost identical files is that the hash values that are returned are not fixed in size. This makes storing and retrieving of the values, in a database, inefficient. Furthermore, the returned numerical difference is not always proportional to the actual differences that exist in two files that are being compared. Furthermore, the algorithm for computing ssdeep is slow relative to existing hash functions.
Accordingly, what is desired is a system and method for locating almost identical files via mechanisms other than fuzzy hashing.