xc2xa71.1 Field of the Invention
The present invention concerns information management and retrieval in general. More specifically, the present invention concerns detecting, and optionally removing, duplicate and near-duplicate information or content, such as in a repository of documents to be searched for example.
xc2xa71.2 Related Art
In the following, the term xe2x80x9cdocument(s)xe2x80x9d should be broadly interpreted and may include content such as Web pages, text files, multimedia files, object features, link structure, etc. Also, it should be noted that when near-duplicate documents are detected, exact duplicate documents will also be detected as a consequence (though such exact duplicates might not necessarily be distinguished from near-duplicates).
Detecting near-duplicate documents has many potential applications. For example, duplicate or near-duplicate documents may indicate plagiarism or copyright infringement. One important application of near-duplicate document detection is in the context of information storage and retrieval.
Efficient techniques to detect documents that are exact duplicates exist. Detecting whether or not documents are near-duplicates is more difficult, particularly in large collections of documents. For example, the Internet, collectively, includes literally billions of xe2x80x9cWeb sitexe2x80x9d documents.
Sources of duplicate and near-duplicate documents on the Internet are introduced in xc2xa71.2.1 below. Then, problems that these duplicate and near-duplicate documents raise, both for end-users and for entities assisting end-users, are introduced in xc2xa71.2.2 below. Finally, previous techniques for detecting duplicate and near-duplicate documents in the context of large document collections, as well as perceived shortcomings of such techniques, are introduced in xc2xa71.2.3 below.
On the Internet, the World Wide Web (referred to as xe2x80x9cthe Webxe2x80x9d) may include the same document duplicated in different forms or at different places. (Naturally, other networks, or even stand alone systems, may have duplicate documents.) Sources of such duplication are introduced here.
First, some documents are xe2x80x9cmirroredxe2x80x9d at different sites on the Web. Such mirroring is used to alleviate potential delays when many users attempt to request the same document at the same time, and/or to minimize network latency (e.g., by caching Web pages locally).
Second, some documents will have different versions with different formatting. For example, a given document may have plain text and HTML (hyper-text markup language) versions so that users can render or download the content in a form that they prefer. As more and more different devices (e.g., computers, mobile phones, personal digital assistants, etc.) are used to access the Internet, a given document may have more and more different versions with different formatting (text only, text plus other media, etc.).
Third, documents are often prepended or appended with information related to its location on the Web, the date, the date it was last modified, a version, a title, a hierarchical classification path (e.g., a Web page may be classified under more than one class within the hierarchy of a Web site), etc. An example of such near-duplicate documents is illustrated in xc2xa74.4 below, with reference to FIGS. 13 through 18.
Fourth, in some instances a new document is generated from an existing document using a consistent word replacement. For example, a Web site may be xe2x80x9cre-brandedxe2x80x9d for different audiences by using word replacement.
Finally, some Web pages aggregate or incorporate content available from another source on the Web.
Duplicate and near-duplicate documents raise potential problems for both people accessing information (e.g., from the Web) and entities helping people to access desired information (e.g., search engine companies). These potential problems are introduced below.
Although people continue to use computers to enter, manipulate and store information, in view of developments in data storage, internetworking (e.g., the Internet), and interlinking and cross referencing of information (e.g., using hyper-text links), people are using computers (or more generally, information access machines) to access information to an ever increasing extent.
Search engines have been employed to help users find desired information. Search engines typically search databased content or xe2x80x9cWeb sitesxe2x80x9d pursuant to a user query. In response to a user""s query, a rank-ordered list, which typically includes brief descriptions of the uncovered content, as well as hyper-texts links (i.e., text, having associated URLs) to the uncovered content, is returned. The rank-ordering of the list is typically based on a match between words appearing in the query and words appearing in the content.
From the perspective of users, duplicate and near-duplicate documents raise problems. More specifically, when users submit a query to a search engine, most do not want links to (and descriptions of) Web pages which have largely redundant information. For example, search engines typically respond to search queries by providing groups of ten results. If pages with duplicate content were returned, many of the results in one group may include the same content. Thus, there is a need for a technique to avoid providing search results associated with (e.g., having links to) Web pages having duplicate content.
From the perspective of entities hosting search engines, duplicate and near-duplicate documents also raise problemsxe2x80x94giving end-users what they want, being one of them. To appreciate some of the other potential problems raised by duplicate and near-duplicate documents, some search engine technology is introduced first.
Most search engines perform three main functions: (i) crawling the Web; (ii) indexing the content of the Web; and (iii) responding to a search query using the index to generate search results. Given the large amount of information available, these three main functions are automated to a large extent. While the crawl operation will associate words or phrases with a document (e.g., a Web page), the indexing operation will associate document(s) (e.g., Web page(s)) with words or phrases. The search operation then (i) uses that index to find documents (e.g., Web pages) containing various words of a search query, and (ii) ranks or orders the documents found in accordance with some heuristic(s).
Recall that the Web may include the same documents duplicated in different forms or at different places on the Web. For example, as introduced in xc2xa71.2.1 above, documents may be xe2x80x9cmirroredxe2x80x9d at different sites on the Web, documents may have a number of different formats so that users can render or download the content in a form that they prefer, documents may have a different versions with different information prepended or appended, some documents may have been generated from others using consistent word replacement, and some documents may aggregate or incorporate documents available from another source on the Web. It would be desirable to eliminate such duplicates or near-duplicates. Aside from eliminating duplicate or near-duplicate documents to meet user expectations and wishes, eliminating duplicate or near-duplicate documents is desirable to search engine hosting entities to (i) reduce storage requirements (e.g., for the index and data structures derived from the index), and (ii) reduce resources needed to process indexes, queries, etc.
In view of the foregoing, techniques to detect (and eliminate) near-duplicate documents are needed.
Some previous techniques for detecting duplicate and near-duplicate documents involve generating so-called xe2x80x9cfingerprintsxe2x80x9d for elements (e.g., paragraphs, sentences, words, or shingles (i.e., overlapping stretches of consecutive words)) of documents. See, e.g., the articles: A. Z. Broder, xe2x80x9cOn the Resemblance and Containment of Documents,xe2x80x9d Proceedings of Compression and Complexity of Sequences 1997, pp. 21-27, IEEE Computer Society (1988); and S. Brin et al., xe2x80x9cCopy Detection Mechanisms for Digital Documents,xe2x80x9d Proceedings of the ACM SIGMOD Annual Conference, San Jose 1995 (May 1995). Some or all of the generated fingerprints could be used in a duplicate/near-duplicate determination. More specifically, two documents would be considered to be near-duplicates if they share more than a predetermined number (at least two, and generally much higher) of fingerprints. That is, such methods determine when documents share multiple common fingerprints. Generally, if the predetermined number is too low, too many false positives would be generated.
For a large collection of documents (e.g., billions of documents to be indexed by a search engine), this determination becomes quite expensive, computationally and in terms of storage. See, e.g., the article, M. Fang et al., xe2x80x9cComputing Iceberg Queries Efficiently,xe2x80x9d Proc. 24thInt""l. Conf. On Very Large Databases, pp. 299-310 (1998). This problem is not easily overcome. For example, it is not especially useful to xe2x80x9cpreprocessxe2x80x9d the representations of such documents used in the Broder technique to eliminate from further consideration, fingerprints known to be unique. This is because even documents with non-unique fingerprints (i.e., documents remaining after such preprocessing) may, nonetheless, have no near-duplicate documents. Thus, a better duplicate/near-duplicate determination technique is needed.
The present invention may detect near-duplicate documents by (i) for each document, generating fingerprints, (ii) determining near-duplicate documents based on the fingerprints. In one embodiment, the fingerprints may be preprocessed to eliminate those that only occur in one document. In such an embodiment, only the remaining fingerprints would be used when determining near-duplicate documents.
The act of generating fingerprints for each document may be effected by (i) extracting parts (e.g., words) from the documents, (ii) hashing each of the extracted parts to determine which of a predetermined number of lists is to be populated with a given part, and (iii) for each of the lists, generating a fingerprint.
In response to the detected duplicate documents, the present invention may also function to eliminate duplicate documents.
The present invention may function to generate clusters of near-duplicate documents, in which a transitive property is assumed. Each document may have an identifier for identifying a cluster with which it is associated. In this alternative, in response to a search query, if two candidate result documents belong to the same cluster and if the two candidate result documents match the query equally well, only the one deemed more likely to be relevant (e.g., by virtue of a high Page rank, being more recent, etc.) is returned.
In the context of a search engine, the present invention may also be used during a crawling operation to speed up the crawling and to save bandwidth by not crawling near-duplicate Web pages or sites, as determined from documents uncovered in a previous crawl. Further, by reducing the number of Web pages or sites crawled, the present invention can be used to reduce storage requirements of downstream stored data structures. The present invention may also be used after the crawl such that if more than one document are near duplicates, then only one is indexed. The present invention can instead be used later, in response to a query, in which case a user is not annoyed with near-duplicate search results. The present invention may also be used to xe2x80x9cfixxe2x80x9d broken links. That is, if a document (e.g., a Web page) doesn""t exist (at a particular location or URL) anymore, a link to a near-duplicate page can be provided.