A search engine may periodically update itself by using a tool called a web crawler. The web crawler may continuously crawl a network to examine network documents, such as, for example, web pages, to determine which of the network documents are linked to others of the network documents and to determine changes in the network documents since the web crawler previously crawled through the network documents. Typically, web crawlers store content of network documents, as well as information concerning links, within the network documents. Usually, network documents do not change very often. When a network document does change, much of the network document remains unchanged.
One technique that was developed to determine whether changes occurred in documents is MinHashing. MinHashing picks a consistent sample from a set. Using the MinHashing technique to determine whether documents are similar, each document may be viewed as a set of elements. The elements may be, for example, words, numbers, links, and/or other items, included in the documents. Each of the elements of each of the sets may be hashed multiple times, using different hashes, to produce multiple groups of hash values, which are consistent uniformly distributed non-negative random numbers for each of the sets. One may then compute a minimum among the hash values in the multiple groups. When a predetermined number of the computed minima of a first set match the predetermined computed minima of a second set, the documents corresponding to the sets may be considered to be duplicates or near-duplicates. The MinHashing technique determines duplicate, or near-duplicate documents in O(N) time for N documents.
A disadvantage of the MinHashing technique is that the MinHashing technique treats all portions of documents equally. Because there may be overlap in unimportant portions of documents, differences in more important portions of the documents may be difficult, if not impossible, to detect. As a result, a weighted consistent sampling technique was developed.
Using the weighted consistent sampling technique, each of the elements has an associated weight, which is a positive integer value. Additional elements may be injected into a set based on weights associated with the elements of the set. For example, if a set includes elements {“the”, “of”, “conflagration”} having respective weights of {1, 1, 1000}, then additional elements are inserted into the set, such that the number of elements representing the element, “conflagration”, is equal to the associated weight. Thus, for example, “conflagration 1”, “conflagration 2”, . . . “conflagration 999” may be inserted as elements into the set. A single hash may then be applied to each of the elements of each of the sets to produce multiple groups of hash values, which are consistent uniformly distributed random numbers for each of the sets. One may then compute a minimum among the hash values in the multiple groups. When a predetermined number of the computed minima of a first set match the predetermined computed minima of a second set, the documents corresponding to the sets may be considered to be duplicates or near-duplicates. The weighted consistent sampling technique described above determines duplicate, or near-duplicate documents in a time period that is exponential with respect to a number of inputs (a number of elements of the sets, including injected elements). That is, a time to process elements, x, of a set S, in which each of the elements has an associated weight, w(x), is ΣxεSw(x).