There exist numerous methodologies used in document management systems and search systems for efficient data comparison and efficient data management. A typical data-comparison methodology does not actually determine a degree of similarity between two collections of data, such as two documents, but rather determines whether two documents are identical or “sufficiently identical.” Such a methodology typically includes a comparison of the involved documents character by character. There are several problems to this methodology. First, access to an involved document is required each time such document is used in a comparison. Second, the time required to perform the document comparison is necessarily proportional to the length of the longest document. Third, any change to any one of the documents, however minimal, potentially causes the documents to be considered different. There are existing solutions designed to get around the first and third problems.
One way around the first problem is to compute a one-way function of the document. The result is a value that is typically much smaller than the document itself. Thus, the one-way function values of two documents can be compared without the need to access the documents themselves. This enables storage of information needed for comparisons between many documents in a small amount of space and allows mutually-distrusting entities to determine whether they have documents in common without disclosing the content of other documents not in common. However, because the one-way function value is smaller than the document that it represents, it also contains less information, and theoretically many documents can map onto the same one-way function value. If two documents differ in their values for the function, then the documents are confirmed to be different. Whereas, if two documents have the same one-way function value, the best analysis is that the two documents may be identical.
A simple one-way function is a “checksum”. A checksum of a document is computed by simply adding up all of the bytes (or words) in the document, ignoring overflow. This is simple to compute but potentially causes accidental collisions, wherein documents that are not identical can have identical checksums. This is because the checksum is insensitive to the order in which the bytes or words occur; that is, it has no notion of context. A better one-way function is a “cryptographic hash.” Such a function, if well-designed, has the following properties: (1) any one-bit change to the input results in, on average, half of the bits of the hash changing from zero to one or vice versa, and (2) the bits changed are uniformly distributed. From these two properties, it follows that the hash values of documents are uniformly distributed over the range of the hash function, that it is essentially impossible to determine anything about the content of the document from its hash value, and that for reasonably-sized hash values the probability of two documents accidentally having the same hash value is so small that it can be effectively ignored. With an n-bit hash value, if two documents have the same hash, the probability of this being an accidental collision is about 1 in 2n (two to the nth power). Most cryptographic hashes are at least 128 bits, often larger, which means that it is exceedingly unlikely that any collision is accidental. Common examples of cryptographic hash functions are the Digital Signature Algorithm (DSA), the various Secure Hash Algorithms (SHA, SHA-1, etc.), and the various Message Digest algorithms (MD4, MD5).
One way around the last problem associated with character-to-character comparison is to predefine “differences that make no difference” and normalize the involved documents before (or during) the comparison. In one example, there is a predetermination that line breaks are not important and not to be considered in the document comparison. In another example, there is a predetermination that all white spaces (spaces and tabs) are to be considered equivalent. In still another example, it is predetermined that the document comparison is not to be case-sensitive, that is, case is not important and letters in the involved documents are all converted to lower case (or upper case). Such normalization is possible with the aforementioned techniques for the computation of a one-way function. While normalization is often helpful, it has the drawback that the normalization routine needs to be defined.
As noted above, the conventional character-by-character comparison allows only discovery of documents that are identical.