§ 1.1 Field of the Invention
The present invention concerns information management and retrieval in general. More specifically, the present invention concerns detecting, and optionally removing, duplicate and near-duplicate information or content, such as in a repository of documents to be searched for example.
§ 1.2 Related Art
In the following, the term “document(s)” should be broadly interpreted and may include content such as Web pages, text files, multimedia files, object features, link structure, etc. Also, it should be noted that when near-duplicate documents are detected, exact duplicate documents will also be detected as a consequence (though such exact duplicates might not necessarily be distinguished from near-duplicates).
Detecting near-duplicate documents has many potential applications. For example, duplicate or near-duplicate documents may indicate plagiarism or copyright infringement. One important application of near-duplicate document detection is in the context of information storage and retrieval.
Efficient techniques to detect documents that are exact duplicates exist. Detecting whether or not documents are near-duplicates is more difficult, particularly in large collections of documents. For example, the Internet, collectively, includes literally billions of “Web site” documents.
Sources of duplicate and near-duplicate documents on the Internet are introduced in § 1.2.1 below. Then, problems that these duplicate and near-duplicate documents raise, both for end-users and for entities assisting end-users, are introduced in § 1.2.2 below. Finally, previous techniques for detecting duplicate and near-duplicate documents in the context of large document collections, as well as perceived shortcomings of such techniques, are introduced in § 1.2.3 below.
§ 1.2.1 Sources of Duplicate and Near-Duplicate Documents on the Internet
On the Internet, the World Wide Web (referred to as “the Web”) may include the same document duplicated in different forms or at different places. (Naturally, other networks, or even stand alone systems, may have duplicate documents.) Sources of such duplication are introduced here.
First, some documents are “mirrored” at different sites on the Web. Such mirroring is used to alleviate potential delays when many users attempt to request the same document at the same time, and/or to minimize network latency (e.g., by caching Web pages locally).
Second, some documents will have different versions with different formatting. For example, a given document may have plain text and HTML (hyper-text markup language) versions so that users can render or download the content in a form that they prefer. As more and more different devices (e.g., computers, mobile phones, personal digital assistants, etc.) are used to access the Internet, a given document may have more and more different versions with different formatting (text only, text plus other media, etc.).
Third, documents are often prepended or appended with information related to its location on the Web, the date, the date it was last modified, a version, a title, a hierarchical classification path (e.g., a Web page may be classified under more than one class within the hierarchy of a Web site), etc. An example of such near-duplicate documents is illustrated in § 4.4 below, with reference to FIGS. 13 through 18.
Fourth, in some instances a new document is generated from an existing document using a consistent word replacement. For example, a Web site may be “re-branded” for different audiences by using word replacement.
Finally, some Web pages aggregate or incorporate content available from another source on the Web.
§ 1.2.2 Problems Raised by Duplicate and Near-Duplicate Documents
Duplicate and near-duplicate documents raise potential problems for both people accessing information (e.g., from the Web) and entities helping people to access desired information (e.g., search engine companies). These potential problems are introduced below.
Although people continue to use computers to enter, manipulate and store information, in view of developments in data storage, internetworking (e.g., the Internet), and interlinking and cross referencing of information (e.g., using hyper-text links), people are using computers (or more generally, information access machines) to access information to an ever increasing extent.
Search engines have been employed to help users find desired information. Search engines typically search databased content or “Web sites” pursuant to a user query. In response to a user's query, a rank-ordered list, which typically includes brief descriptions of the uncovered content, as well as hyper-texts links (i.e., text, having associated URLs) to the uncovered content, is returned. The rank-ordering of the list is typically based on a match between words appearing in the query and words appearing in the content.
From the perspective of users, duplicate and near-duplicate documents raise problems. More specifically, when users submit a query to a search engine, most do not want links to (and descriptions of) Web pages which have largely redundant information. For example, search engines typically respond to search queries by providing groups of ten results. If pages with duplicate content were returned, many of the results in one group may include the same content. Thus, there is a need for a technique to avoid providing search results associated with (e.g., having links to) Web pages having duplicate content.
From the perspective of entities hosting search engines, duplicate and near-duplicate documents also raise problems—giving end-users what they want, being one of them. To appreciate some of the other potential problems raised by duplicate and near-duplicate documents, some search engine technology is introduced first.
Most search engines perform three main functions: (i) crawling the Web; (ii) indexing the content of the Web; and (iii) responding to a search query using the index to generate search results. Given the large amount of information available, these three main functions are automated to a large extent. While the crawl operation will associate words or phrases with a document (e.g., a Web page), the indexing operation will associate document(s) (e.g., Web page(s)) with words or phrases. The search operation then (i) uses that index to find documents (e.g., Web pages) containing various words of a search query, and (ii) ranks or orders the documents found in accordance with some heuristic(s).
Recall that the Web may include the same documents duplicated in different forms or at different places on the Web. For example, as introduced in § 1.2.1 above, documents may be “mirrored” at different sites on the Web, documents may have a number of different formats so that users can render or download the content in a form that they prefer, documents may have a different versions with different information prepended or appended, some documents may have been generated from others using consistent word replacement, and some documents may aggregate or incorporate documents available from another source on the Web. It would be desirable to eliminate such duplicates or near-duplicates. Aside from eliminating duplicate or near-duplicate documents to meet user expectations and wishes, eliminating duplicate or near-duplicate documents is desirable to search engine hosting entities to (i) reduce storage requirements (e.g., for the index and data structures derived from the index), and (ii) reduce resources needed to process indexes, queries, etc.
In view of the foregoing, techniques to detect (and eliminate) near-duplicate documents are needed.
§ 1.2.3 Known Techniques for Detecting Duplicate and Near-Duplicate Documents
Some previous techniques for detecting duplicate and near-duplicate documents involve generating so-called “fingerprints” for elements (e.g., paragraphs, sentences, words, or shingles (i.e., overlapping stretches of consecutive words)) of documents. See, e.g., the articles: A. Z. Broder, “On the Resemblance and Containment of Documents,” Proceedings of Compression and Complexity of Sequences 1997, pp. 21-27, IEEE Computer Society (1988); and S. Brin et al., “Copy Detection Mechanisms for Digital Documents,” Proceedings of the ACM SIGMOD Annual Conference, San Jose 1995 (May 1995). Some or all of the generated fingerprints could be used in a duplicate/near-duplicate determination. More specifically, two documents would be considered to be near-duplicates if they share more than a predetermined number (at least two, and generally much higher) of fingerprints. That is, such methods determine when documents share multiple common fingerprints. Generally, if the predetermined number is too low, too many false positives would be generated.
For a large collection of documents (e.g., billions of documents to be indexed by a search engine), this determination becomes quite expensive, computationally and in terms of storage. See, e.g., the article, M. Fang et al., “Computing Iceberg Queries Efficiently,” Proc. 24th Int'l. Conf. On Very Large Databases, pp. 299-310 (1998). This problem is not easily overcome. For example, it is not especially useful to “preprocess” the representations of such documents used in the Broder technique to eliminate from further consideration, fingerprints known to be unique. This is because even documents with non-unique fingerprints (i.e., documents remaining after such preprocessing) may, nonetheless, have no near-duplicate documents. Thus, a better duplicate/near-duplicate determination technique is needed.