§1.1 Field of the Invention
The present invention concerns information management and retrieval in general. More specifically, the present invention concerns detecting, and optionally removing, duplicate and near-duplicate information or content, such as in a repository of documents to be searched for example.
§1.2 Background Information
In the following, the term “document(s)” should be broadly interpreted and may include content such as Web pages, text files, multimedia files, object features, link structure, etc. Also, it should be noted that when near-duplicate documents are detected, exact duplicate documents will also be detected as a consequence (though such exact duplicates might not necessarily be distinguished from near-duplicates).
Detecting near-duplicate documents has many potential applications. For example, duplicate or near-duplicate documents may indicate plagiarism or copyright infringement. One important application of near-duplicate document detection is in the context of information storage and retrieval.
Efficient techniques to detect documents that are exact duplicates exist. Detecting whether or not documents are near-duplicates is more difficult, particularly in large collections of documents. For example, the Internet, collectively, includes literally billions of “Web site” documents.
Sources of duplicate and near-duplicate documents on the Internet are introduced in §1.2.1 below. Then, problems that these duplicate and near-duplicate documents raise, both for end-users and for entities assisting end-users, are introduced in §1.2.2 below. Finally, previous techniques for detecting duplicate and near-duplicate documents in the context of large document collections, as well as perceived shortcomings of such techniques, are introduced in §1.2.3 below.
§1.2.1 Sources of Duplicate and Near-Duplicate Documents on the Internet
On the Internet, the World Wide Web (referred to as “the Web”) may include the same document duplicated in different forms or at different places. (Naturally, other networks, or even stand alone systems, may have duplicate documents.) Sources of such duplication are introduced here.
First, some documents are “mirrored” at different sites on the Web. Such mirroring is used to alleviate potential delays when many users attempt to request the same document at the same time, and/or to minimize network latency (e.g., by caching Web pages locally).
Second, some documents will have different versions with different formatting. For example, a given document may have plain text and HTML (hyper-text markup language) versions so that users can render or download the content in a form that they prefer. As more and more different devices (e.g., computers, mobile phones, personal digital assistants, etc.) are used to access the Internet, a given document may have more and more different versions with different formatting (text only, text plus other media, etc.).
Third, documents are often prepended or appended with information related to its location on the Web, the date, the date it was last modified, a version, a title, a hierarchical classification path (e.g., a Web page may be classified under more than one class within the hierarchy of a Web site), etc.
Fourth, in some instances a new document is generated from an existing document using a consistent word replacement. For example, a Web site may be “re-branded” for different audiences by using word replacement.
Finally, some Web pages aggregate or incorporate content available from another source on the Web.
§1.2.2 Problems Raised by Duplicate and Near-Duplicate Documents
Duplicate and near-duplicate documents raise potential problems for both people accessing information (e.g., from the Web) and entities helping people to access desired information (e.g., search engine companies). These potential problems are introduced below.
Although people continue to use computers to enter, manipulate and store information, in view of developments in data storage, internetworking (e.g., the Internet), and interlinking and cross referencing of information (e.g., using hyper-text links), people are using computers (or more generally, information access machines) to access information to an ever increasing extent.
Search engines have been employed to help users find desired information. Search engines typically search databased content or “Websites” or “Web pages” pursuant to a user query. In response to a user's query, a rank-ordered list, which typically includes brief descriptions of the uncovered content, as well as hyper-texts links (i.e., text, having associated URLs) to the uncovered content, is returned. The rank-ordering of the list is typically based on a match between words appearing in the query and words appearing in the content.
From the perspective of users, duplicate and near-duplicate documents raise problems. More specifically, when users submit a query to a search engine, most do not want links to (and descriptions of) Web pages which have largely redundant information. For example, search engines typically respond to search queries by providing groups of ten results. If pages with duplicate content were returned, many of the results in one group may include the same content. Thus, there is a need for techniques to avoid providing search results associated with (e.g., having links to) Web pages having duplicate content.
From the perspective of entities hosting search engines, duplicate and near-duplicate documents also raise problems—giving end-users what they want, being one of them. To appreciate some of the other potential problems raised by duplicate and near-duplicate documents, some search engine technology is introduced first.
Most search engines perform three main functions: (i) crawling the Web; (ii) indexing the content of the Web; and (iii) responding to a search query using the index to generate search results. Given the large amount of information available, these three main functions are automated to a large extent. While the crawl operation will associate words or phrases with a document (e.g., a Web page), the indexing operation will associate document(s) (e.g., Web page(s)) with words or phrases. The search operation then (i) uses that index to find documents (e.g., Web pages) containing various words of a search query, and (ii) ranks or orders the documents found in accordance with some heuristic(s).
Recall that the Web may include the same documents duplicated in different forms or at different places on the Web. For example, as introduced in §1.2.1 above, documents may be “mirrored” at different sites on the Web, documents may have a number of different formats so that users can render or download the content in a form that they prefer, documents may have a different versions with different information prepended or appended, some documents may have been generated from others using consistent word replacement, and some documents may aggregate or incorporate documents available from another source on the Web. It would be desirable to eliminate such duplicates or near-duplicates. Besides eliminating duplicate or near-duplicate documents to meet user expectations and wishes, eliminating duplicate or near-duplicate documents is desirable to search engine hosting entities to (i) reduce storage requirements (e.g., for the index and data structures derived from the index), and (ii) reduce the time and/or computational resources needed to process indexes, queries, etc.
In view of the foregoing, techniques to detect (and eliminate) near-duplicate documents are needed.
§1.2.3 Known Techniques for Detecting Duplicate and Near-Duplicate Documents and their Perceived Limitations
A naive solution would be to compare all pairs to documents. Since this is prohibitively expensive on large datasets, Manber (U. Manber, “Finding similar files in a large file system,” Proc. of the USENIX Winter 1994 Technical Conference (January 1994)) and Heintze (N. Heintze, “Scalable Document Fingerprinting,” Proc. of the 2nd USENIX Workshop on Electronic Commerce (November 1996)) proposed algorithms for detecting near-duplicate documents that reduced the number of comparisons. Both algorithms work on sequences of adjacent characters. Brin et al. (S. Brin, J. Davis, and H. Garcia-Molina, “Copy Detection Mechanisms for Digital Documents,” 1995 ACM SIGMOD International Conference on Management of Data, pp. 398-409 (May 1995)) started to use word sequences to detect copyright violations. Broder et al. (A. Broder, S. Glassman, M. Manasse, and G. Zweig, “Syntactic Clustering of the Web,” 6th International World Wide Web Conference, pp. 93-404 (April 1997), incorporated herein by reference) also used word sequences to efficiently find near-duplicate Web pages. Later, Charikar (M. S. Charikar, “Similarity Estimation Techniques from Rounding Algorithms,” 34th Annual ACM Symposium on Theory of Computing (May 2002), incorporated herein by reference. See also U.S. Patent Application Publication 2006/0101060, also incorporated herein by reference.) developed an approach based on random projections of the words in a document. Recently Hoad and Zobel (T. C. Hoad and J. Zobel, “Methods for identifying versioned and plagiarised documents,” Journal of the American Society for Information Science and Technology, 54(3), pp. 203-215 (2003)) developed and compared methods for identifying versioned and plagiarized documents. Unfortunately, however, the technique recommended by Hoad and Zobel is inefficient, having O(N2) computational complexity, where N is the number of documents to be compared with a document of interest.
§1.2.3.1 Introduction of the Broder and Charikar Algorithms for Document Similarity
In both the Broder and Charikar algorithms, each HTML page is converted into a token sequence. The two algorithms differ only in how they convert the token sequence into a bit string representing the page.
To convert an HTML page into a token sequence, all HTML markup in the page is replaced by white space or, in case of formatting instructions, ignored. Then every maximal alphanumeric sequence is considered a term and is hashed using Rabin's fingerprinting scheme (M. Rabin, “Fingerprinting by random polynomials,” Report TR-15-81, Center for Research in Computing Technology, Harvard University (1981), incorporated herein by reference) to generate tokens, with two exceptions.
Both algorithms generate a bit string from the token sequence of a page and use it to determine the near-duplicates for the page.
Let n be the length of the token sequence of a page. Using the Broder algorithm every sub-sequence of k tokens (where the sub-sequences overlap) is fingerprinted using 64-bit Rabin fingerprints, which results in a sequence of (n−k+1) fingerprints, called “shingles”. Let S (d) be the set of shingles of the page “d”. The Broder algorithm makes the assumption that the percentage of unique shingles on which the two pages d and d′ agree. That is, the Broder algorithm assumes that
                        S        ⁡                  (          d          )                    ⋂              S        ⁡                  (                      d            ′                    )                                                S        ⁡                  (          d          )                    ⋃              S        ⁡                  (                      d            ′                    )                        is a good measure for the similarity of d and d′.
The foregoing may be approximated by fingerprinting every shingle with m different fingerprinting functions fi for 1≦i≦m. This leads to (n−k+1) values for each fi. For each i, the smallest of these values is called “the i-th minvalue” and is stored at the page. As a result, the Broder algorithm creates an m-dimensional vector of minvalues. Broder et al. showed that the expected percentage of entries in the minvalues vector that two pages d and d′ agree on is equal to the percentage of unique shingles on which d and d′ agree. Thus, to estimate the similarity of two pages, it suffices to determine the percentage of agreeing entries in the minvalues vectors. To save space and speed up the similarity computation, the m-dimensional vector of minvalues might be reduced to an m′-dimensional vector of supershingles by fingerprinting non-overlapping sequences of minvalues. Let m be divisible by m′ and let =m/m′. The concatenation of minvalue j * . . . , (j+1)*−1 might be fingerprinted for 0≦•j<m′ with yet another fingerprinting function. (Note that the notion of “megashingles” was also introduced in the Broder et al. paper in order to further speed up the algorithm. Megashingles are generated by fingerprinting every pair of supershingles, such that each megashingle is a fingerprinted pair of supershingles. Since, however, using megashingles does not improve precision or recall, they need not be used.) Two pages are near-duplicates under the Broder algorithm (referred to here as “B-similar”) if and only if their supershingle vectors agree in at least two supershingles. The number of identical entries in the supershingle vectors of two pages is their “B-similarity”. The parameters for the Broder algorithm are m, m′, and k.
As can be appreciated from the foregoing, the Broder algorithm is order dependent (e.g., since the shingles are fingerprints of overlapping sub-sequences), but is independent of the frequency of the shingles.
The Charikar algorithm is now described. Let b be a constant. Each token is projected into b-dimensional space by randomly choosing b entries from {−1, 1 }. The resulting b-dimensional vector may be referred to as a “token vector”. The same tokens, whether occurring on the same page or on different pages, will have the same b-dimensional representation (i.e., the same “token vector”). For each page, a representative b-dimensional vector (which may be referred to as an “initial page vector”) is created by adding the projections of all the tokens in the page's token sequence (i.e., adding all of the page's “token vectors”). The final vector for the page (which may be referred to as a “final page vector”) is created by setting every positive entry in the vector to 1 and every non-positive entry to 0. The generates a random b-dimensional projection (i.e., a final page vector) for each page. The final page vectors have the property that the cosine similarity of two pages is proportional to the number of bits in which the two corresponding projections agree. That is, similarity in the Charikar algorithm (referred to as “C-similarity”) of two Web pages is the number of bits their projections agree on. Two pages are near-duplicates in the Charikar algorithm (or are C-similar) if the number of agreeing bits in their projections is above a fixed threshold t.
As can be appreciated from the foregoing, given the definition of minvalues (from which supershingles, and perhaps even megashingles, are generated), Broder's technique uses representations based on a subset of the words (or tokens) of the original document being analyzed. On the other hand, Charikar's technique uses representations based on all words (not removed by preprocessing) (or tokens) of the original document being analyzed. That is, Charikar's technique might consider all words (or tokens) of documents accepted as inputs. Further, Broder's technique uses set intersection to determine whether or not documents are near-duplicates. On the other had, Charikar's technique uses random projections to determine whether or not documents are near-duplicates.
As can be appreciated from the foregoing, in both algorithms pages with the same token sequence are assigned the same bit string. The Charikar algorithm ignores the order of the tokens (given the additive aspect of generating a page vector from token vectors). The shingles of the Broder algorithm are based on the order of the tokens. However, the Broder algorithm ignores the frequency of shingles. On the other hand, the Charikar algorithm accounts for the frequency of terms (again, given the additive aspect of generating a page vector from token vectors). For both algorithms there can be false positives (non near-duplicate pairs returned as near-duplicates) as well as false negatives (near-duplicate pairs not returned as near-duplicates.)
Let T be the sum of the number of tokens in all documents and let D be the number of documents. The Broder algorithm takes time O(Tm+Dm′)=O(Tm). The Broder algorithm takes time O(Tb) to determine the vector for each page. As described below, the C-similar pairs might be computed using a trick similar to supershingles. It takes time O(D) so that the total time for the Charikar algorithm is O(Tb).
Some embodiments consistent with the present invention might further (d) process the set of documents to determine a third set of near-duplicate documents using the second document similarity technique, (e) determine a fourth set of near duplicate documents by determining the union of the second set of near duplicate document and the third set of near-duplicate documents.
§1.2.3.2 Evaluation of the Broder and Charikar Algorithms
The present inventor evaluated the Broder and Charikar algorithms on 1.6 B distinct Web pages, according to three criteria—(1) precision on a random subset, (2) the distribution of the number of term differences per near-duplicate pair, and (3) the distribution of the number of near-duplicates per page. All parameters in the Broder algorithm were set as suggested in the literature. The parameters in the Charikar algorithm were chosen so that it used the same amount of space per document and returned about the same number of correct near-duplicate pairs (i.e., had about the same recall).
The present inventor found that the Charikar algorithm achieved a precision of 0.50, while the Broder algorithm achieved a precision of 0.38. Both algorithms were found to perform about the same for near-duplicate pairs on the same site (low precision) and for near-duplicate pairs on different sites (high precision). However, over 90% of the near-duplicate pairs found by the Broder algorithm belonged to the same site, but only 74% of the near-duplicate pairs found by the Charikar algorithm belonged to the same site.
Thus, the Charikar algorithm found more of the near-duplicate pairs for which precision is high. The number of term differences per near-duplicate pair was found to be very similar for the two algorithms, but the Broder algorithm returned fewer pairs with extremely large term differences. The distribution of the number of near-duplicates per page was found to follow a power-law for both algorithms. However, the Broder algorithm was found to have a higher “spread” around the power-law curve. The present inventor believes that a possible reason for that “noise” is that the bit string representing a page in the Broder algorithm is based on a randomly selected subset of terms in the page. Thus, there might be “lucky” and “unlucky” choices, leading to false near-duplicate pairs or missing actual near-duplicate pairs. The Charikar algorithm does not select a subset of terms but is based on all terms in the page.
The present inventor found that neither of the algorithms worked well for finding near-duplicate pairs on the same Website, though both achieved high precision for near-duplicate pairs on different Websites.
In view of the foregoing, it would be useful to provide improved techniques for finding near-duplicate documents. It would be useful if such techniques improved the precision of the Broder and Charikar algorithms. Finally, it would be useful if such techniques worked well for finding near-duplicate pairs on the same Website, as well as on different Websites.