A. Field of the Invention
This invention is in the field of computerized processing of documents, and more particularly it is about creating a measurement of similarity, so that duplicate and near duplicate documents can be identified.
In regard to present invention the term “document” is broadly interpreted and represents documents such as Web pages, text files, multimedia files, DNA sequence files, etc.
B. Description of the Related Art
In many computer systems, it is useful to determine the resemblance between documents stored in the system. One example is cataloging the large number of documents that are identical or nearly identical. Another example is a need for removal of near-identical documents, which are created by copying of documents or making gradual small changes in documents (producing different versions). The storage space can be reduced significantly by storing only one version of a set of similar documents.
From the search engine's perspective, there is a problem of serving search results containing large number of identical or nearly identical documents. It is desirable for the search engine to identify such documents to remove duplication from search results.
Therefore, it is desired to provide a method that can determine the resemblance of documents.
A naive solution would be to compare all pairs of documents, but on large datasets it is a prohibitively expensive approach.
Existing techniques for detecting duplicate and near-duplicate documents involve generating so-called “fingerprints” of documents, and two documents are considered to be near-duplicates if they share more than a predetermined number of fingerprints. Broder et al. (U.S. Pat. No. 6,349,296, (February 2002)) used such technique to find near-duplicate Web pages.
Hoad and Zobel (T. C. Hoad and J. Zobel, “Methods for identifying versioned and plagiarised documents,” Journal of the American Society for Information Science and Technology, 54(3), pp. 203-215 (2003)) compared existing methods for identifying versioned and plagiarized documents and proposed their own approach.
Pugh at al. (U.S. Pat. No. 6,658,423, (December 2003)) developed a technique assigning a number of fingerprints to a given document and considering two documents to be near-duplicates if any one of their fingerprints match.
Another approach was developed by Charikar (U.S. Pat. No. 7,158,961, (January 2007)), which employs a similarity engine generating compact representations of objects called sketches. Sketches of different objects can be compared to determine the similarity between the two objects.
In both the Broder and Charikar algorithms, each document (HTML page) is converted into a token sequence and both algorithms generate a bit string from the token sequence of a page and use it to determine the near-duplicates for the page, which is consistent with present invention.
The Charikar algorithm ignores the order of tokens, but accounts for the frequency of terms. The Broder algorithm accounts for the order of the tokens, however, it ignores the frequency of shingles. For both algorithms there can be false positives (non near-duplicate pairs returned as near-duplicates), as well as false negatives (near-duplicate pairs not returned as near-duplicates). Henzinger researched Broder and Charikar algorithms and proposed a solution for large datasets (U.S. Pat. No. 8,015,162, (September 2011)) as a combination of both.
Shen's “Document near-duplicate detection” method (U.S. Pat. No. 7,962,491, (June 2011)), also includes generation of compact “fingerprint” that describes the contents of the document. The similarity detection component compares multiple fingerprints based on the hamming distance between the fingerprints. When the hamming distance is below a threshold, the documents can be said to be near-duplicates of one another.