Large collections of documents typically include many documents that are identical to or nearly identical to one another. Determining whether two digitally-encoded documents are bit-for-bit identical is easy (using hashing techniques, for example). Quickly identifying documents that are roughly or effectively identical, however, is a more challenging and, in many contexts, a more useful task. The World Wide Web is an extremely large set of documents. The Web having grown exponentially since its birth, Web indexes currently include approximately five billion web pages (the static Web being estimated at twenty billion pages), a significant portion of which are duplicates and near-duplicates. Applications such as web crawlers and search engines benefit from the capacity to detect near-duplicates. For example, it may be desirable to have such applications ignore most duplicates and near-duplicates, or to filter the results of a query so that similar documents are grouped together.
“Shingling” or “shingleprinting” techniques have been developed to address the problem of finding similar objects in large collections. Various aspects of such techniques are described in the following patent references: U.S. Pat. No. 5,909,677, Broder et al., “Method for Determining the Resemblance of Documents,” filed on Jun. 18, 1996; U.S. Pat. No. 5,974,481, Broder, “Method for Estimating the Probability of Collisions of Fingerprints,” filed on Sep. 15, 1997; U.S. Pat. No. 6,269,362, Broder et al., “System and Method for Monitoring Web Pages by Comparing Generated Abstracts,” filed on Dec. 19, 1997, in which the inventor of the present application is a co-inventor; U.S. Pat. No. 6,119,124, Broder et al., “Method for Clustering Closely Resembling Data Objects,” filed on Mar. 26, 1998, in which the inventor of the present application is a co-inventor; U.S. Pat. No. 6,349,296, Broder et al., “Method for Clustering Closely Resembling Data Objects,” filed on Aug. 21, 2000, in which the inventor of the present application is a co-inventor; and U.S. patent application Ser. No. 09/960,583, Manasse et al., “System and Method for Determining Likely Identity in a Biometric Database”, filed on Sep. 21, 2001 and published on Mar. 27, 2003, in which the inventor of the present application is a co-inventor. See also Broder, “On the Resemblance and Containment of Documents,” 1997 Proc. Compression & Complexity of Sequences 21-29 (IEEE 1998); Broder, Glassman, Manasse, and Zweig, “Syntactic Clustering of the Web,” Proc. 6th Intl. World Wide Web Conf. 391-404 (April 1997); Manasse, “Finding Similar Things Quickly in Large Collections,”<http://research.microsoft.com/research/sv/PageTurner/similarity.htm>(2004). Each of these patent and non-patent references is incorporated herein by reference.
In the shingling approach, a document is reduced to a set of features that are sufficiently representative of the document, so that two very similar documents will share a large number of features. For a text-content document, it has proved useful to extract as features the set of overlapping contiguous w-word subphrases (its “w-shingling”), where w is a fixed number. Letting D1 and D2 be documents, and F1 and F2 their respective sets of features, we define the similarity of D1 and D2 to be the Jaccard coefficient of the feature sets,
      Sim    ⁡          (                        D          1                ,                  D          2                    )        =                                    F          1                ⋂                  F          2                                                          F          1                ⋃                  F          2                          (that is, the number of common features in the two documents, divided by the total number of features in the two documents). This gives a number between 0 and 1; the similarity of two essentially-equivalent documents will be a number close to one, while the similarity for most pairs of dissimilar documents will be a number close to zero. It should be noted that shingling techniques for detecting effectively-identical items in large collections are not restricted to text corpora. Shingling may be applied to collections of any sort of data object, such as sound recordings or visual images, for which it is possible to extract a set of representative features.
The number of features extracted from each document is potentially quite large (as large as the number of words in the document). If it is assumed that the document collection is itself very large (perhaps billions, as in the case of the Web), computing similarity values exactly and performing pairwise comparison is quadratic in the size of the collection, which is prohibitively expensive. Similarity is therefore approximated in order to reduce the problem to one of manageable size.
The approximation involves sampling the feature set of each document in a way that preserves document similarity. In principle, one can use a random one-to-one function from the space of features to a well-ordered set larger than the set of all features. By well-orderedness, there is a smallest element of the image of the feature set under the random function. The pre-image of the smallest element is taken as the chosen sample feature. This works because all functions are equally probable. Any element of a set is as likely to be mapped to the smallest element, and, when choosing from two sets, the smallest element is uniformly chosen from the union.
The foregoing scheme must be altered in order for it to be practically implementable. First, to pick uniformly, it is convenient to make the image set a finite set of integers. If the feature set is unbounded, it is difficult to get a one-to-one function to a finite set. Using a well-selected hash function, preferably Rabin fingerprints, to hash each feature into a number with a fixed number of bits, a set can be chosen that is large enough that the probability of collisions across the set is vanishingly small. Second, instead of picking a truly random function, the function is chosen from a smaller, easily parameterized set of functions, where the chosen function is provably good enough to get arbitrarily close to the correct probability. Typically, a combination of linear congruential permutations is used along with Rabin fingerprints, although this is not provably correct.
The technique provides a mechanism for selecting one feature fi from each feature set Fi such that Prob(fi=fj)=Sim(Di,Dj). This selection mechanism provides unbiased point estimators for similarity. Some number r of selectors is chosen. For each document Di, fil, . . . , fir is computed, using each selector once on Di. At the cost of some preprocessing, this reduces the data storage for each item to a constant, and reduces comparison of sets to matching terms in vectors.
By running multiple independent selection mechanisms, an estimate of the percentage of similarity of two documents is obtained by counting matches in the vectors of selections. If p=Sim(D,E) then each term in the vectors for D and E match with probability p. The probability of matching k terms in a row is pk. The vectors can be compressed by hashing non-overlapping runs of k items to single integers chosen from a large enough space that the probability of collisions in the hash values is negligible, while reducing storage needs by a factor of k. If there are s groups (“supersamples”) of length k, the probability of one or more supersamples matching is 1−(1−pk)s and the probability of two or more supersamples matching is 1−(1−pk)s−s(1−pk)s−1.
In previous work relating to the Alta Vista search engine, the 6-shingling of a normalized version of the text of each document was extracted as the feature set. Features were represented as 64-bit integers. The technique of using linear congruential permutations was applied to each 64-bit integer, producing a new set of 64-bit integers, and the pre-image of the smallest value in the new set was chosen as a sample. 84 samples were taken, divided into six supersamples combining fourteen samples each. Thus, for the parameters k and s, the values k=14 and s=6 were used. These parameter choices were made because the desired similarity threshold for near-duplicate documents was 0.95. The probability of fourteen samples matching between two documents is equal to the similarity of the documents raised to the fourteenth power, so that if the documents are near-duplicates, the probability will be 0.9514, which is approximately one-half. With six groups of fourteen samples, it is therefore likely that at least two groups out of the six will match, and it is unlikely that fewer than two groups will be a match. Thus, to decide that the documents were probably near-duplicates, two out of six supersamples were required to match. The previous work was effective in practice in identifying near-duplicate items in accordance with the desired threshold.
In the previous work it was found that the matching process could be simplified to a small number of hash table lookups per item. The k samples in a group are compressed into a 64-bit integer. As is explained further below, each supersample is recorded with 64-bit precision in order to avoid accidental agreement with dissimilar documents. All
      (                            s                                      2                      )    =  15the possible pairs of the s 64-bit integers are then inserted into hash tables. If s=6, for example, finding items that match at least two runs requires only lookups, so 15 hash tables are used.
The previous work was focused on the use of an offline process, so economizing main memory was not a primary concern. The hashing optimization described in the previous paragraph, for example, is well-suited to offline processing. The per-document storage requirements of this technique is unacceptable, however, for a search engine that performs it “on the fly” for all the documents. In a five-billion document collection, this would entail a memory footprint of 240 gigabytes to store 6 values, and an additional 520 gigabytes to store the hash-tables. This would be unwieldy at search execution and imposes constraints on index construction. For example, a search engine may perform no preprocessing pass on the full document collection and may incrementally build its index. It may be desirable for the search engine to determine, in an online process, which query results about to be reported are near-duplicates so that the reporting can be reduced to a single document per near-duplicate cluster, using a ranking function to choose that document, which allows the most responsive document to be chosen dynamically.