1. Field of the Invention
This invention relates to a computer-assisted method and apparatus for identifying duplicate and near-duplicate documents or text spans in a collection of documents or text spans, respectively.
2. Description of the Prior Art
The current art includes inventions that compare a single pair of known-to-be-similar documents to identify the differences between the documents. For example, the Unix “diff” program uses an efficient algorithm for finding the longest common sub-sequence (LCS) between two sequences, such as the lines in two documents. Aho, Hopcroft, and Ullman, Data Structures and Algorithms, Addison-Wesley Publishing Company, April 1987, pages 189–192. The lines that are left when the LCS is removed represent the changes needed to transform one document into another. Additionally, U.S. Pat. No. 4,807,182 uses anchor points (points in common between two files) to identify differences between an original and a modified version of a document. There are also programs for comparing a pair of files, such as the Unix “cmp” program.
Another approach for comparing documents is to compute a checksum for each document. If two documents have the same checksum, they are likely to be identical. But comparing documents using checksums is an extremely fragile method, since even a single character change in a document yields a different checksum. Thus, checksums are good for identifying exact duplicates, but not for identifying near-duplicates. U.S. Pat. No. 5,680,611 teaches the use of checksums to identify duplicate records. U.S. Pat. No. 5,898,836 discloses the use of checksums to identify whether a region of a document has changed by comparing checksums for sub-document passages, for example, the text between HTML tags.
Patrick Juola's method, discussed in Juola, Patrick, What Can We Do With Small Corpora? Document Categorization via Cross-Entropy, Proceedings of Workshop on Similarity and Categorization, 1997, uses the average length of matching character n-grams (an n-gram is a string of characters that may comprise all or part of a word) to identify similar documents. For each window of consecutive characters in the source document, the average length of the longest matching sub-sequence at each position in the target document is computed. This effectively computes the average length of match at each position within the target document (counting the number of consecutive matching characters starting from the first character of the n-gram) for every possible character n-gram within the source document. This technique depends on the frequency of the n-grams within the document by requiring the n-grams and all sub-parts (at least the prefix sub-parts) to be of high frequency. The Juola method focuses on applications involving very small training corpora, and has been applied to a variety of areas, including language identification, determining authorship of a document, and text classification. The method does not provide a measure of distinctiveness.
The prior art does not compare more than two documents, does not allow text fragments in each document to appear in a different or arbitrary order, is not selective in the choice of n-grams used to compare the documents, does not use the frequency of the n-grams across documents for selecting n-grams used to compare the documents, and does not permit a mixture of very low frequency and very high frequency components in the n-grams.