Systems and methods developed to characterize digital data (or byte) streams are known. Such systems and methods are often used to detect computer viruses and worms and the like. More specifically, intrusion detection and antivirus systems typically use “signatures” to detect specific patterns or characters or digital bytes. Hashes, checksums and other numeric calculations are frequently used to characterize digital files and bytes streams, including legitimate software files and malware. These techniques are used to identify items that are identical to the source of the signature. Generally speaking, they are not intended or even capable of detecting similar, but non-identical, items.
There is, however, a known approach, as described in Todd Heberlein, Worm Detection and Prevention: Concept, Approach, and Experience, 14 Aug. 2002, NetSquared, Inc. (2002, unpublished) (“Heberlein”), that is capable of detecting similarity among selected sets of data. As explained by Heberlein, it is possible to characterize a selected portion of data using a “thumbprint.” In this case, the thumbprint is represented by the result of a hash function applied to the selected portion of data.
FIG. 6 shows the basic approach according to Heberlein. The original content, “The quick brown fox jumped over the lazy dog.” is sent through a hash function that generates a number. The original content consisted of 360 bits (8 bits per character times 45 characters) and the result is a single 32-bit number (a typical unsigned integer on most computers).
This number can serve as type of compact representation (i.e., the “thumbprint”) of the original content. For example, suppose a document is processed by this technique. A hash number is computed for each sentence in the document, and then the computed hash numbers are stored together in a hash table. Later, if a user provides a sample sentence and asks if that sentence is in the document, the following algorithm can be used to very quickly determine the answer. First, the hash value of the sample sentence is computed. Second, the hash table is queried to see if that number exists in the table. If it is not in the table, then the sample sentence is not in the document. Third, if there is a match, then the sentence (or sentences) in the original document that created the hash value is examined and it is determined if it, indeed, matches the sample sentence.
As further explained by Heberlein, traditional hash functions do not work well in certain scenarios. Specifically, most hash functions are designed to produce a completely different hash number even if the content only varies by a single byte. For example, referring again to FIG. 6, if the original sentence is only slightly modified by changing the word “dog” to “dogs,” then a completely different hash number may be generated. In fact, using traditional hashing functions, a review of the resulting numbers for each string would not indicate that the two sentences were very similar at all.
Heberlein goes on to explain that in order to diminish gross discrepancies between seemingly similar collections of data, it is possible to employ a multivariate statistical analysis technique called principal component analysis (CPA) to the selected data, and, as a result, the gross discrepancies can be significantly diminished.
Despite the advances described by Heberlein, there remains a desire to provide improved systems and methods for detecting computer viruses, worms, other computer attacks and/or any other data that may repeatedly pass over a network.