xc2xa71.1 Field of the Invention
The present invention concerns information retrieval in general. More specifically, the present invention concerns detecting and/or removing duplicate information or duplicate content in response to, and based on, an information search query.
xc2xa71.2 Related Art
xc2xa71.2.1 The Migration from Data Entry, Manipulation and Storage, to Information Access
The ways in which people use computing machines has evolved over the last 50 or so years. The proliferation of networks, along with the increased availability of inexpensive data storage means, has afforded computer users unprecedented access to a wealth of content. Such content may be presented to a user (or xe2x80x9crenderedxe2x80x9d) in the form of text, images, audio, video, etc.
Although people continue to use computers to enter, manipulate and store information, in view of the foregoing developments, people are using computers (or more generally, information access machines) to access information to an ever increasing extent. Unfortunately, however, the very vastness of available information which has attracted many users, can overwhelm users. Consequently, desired information can become difficult to find.
xc2xa71.2.2 Known Techniques for Finding Desired Information
Various techniques have been employed to help users locate desired information. In the context of the Internet for example, some services have organized content based on a hierarchy of categories. A user may then navigate through a series of hierarchical menus to find content that may be of interest to them. An example of such a service is the YAHOO(trademark) web site on the Internet.
Again in the context of the Internet for example, some services provide xe2x80x9csearch enginesxe2x80x9d which search content or xe2x80x9cweb sitesxe2x80x9d pursuant to a user query. In response to a user""s query, a rank ordered list, which typically includes brief descriptions of the content, as well as hyper-text links (i.e., text, having associated URLs) to the content is returned. The rank ordering of the list is typically based on a degree of match between words appearing in the query and words appearing in the content.
xc2xa71.2.2.1 Automated Indexing and its Perceived Shortcomings
Most search engines perform three main functions: (i) crawling the World Wide Web; (ii) indexing the content of the World Wide Web; and (iii) responding to a search query using the index to generate search results. The crawl operation collects web pages. The indexing operation associates document(s) (e.g., web page(s)) with words or phrases, and also creates an inverted index which associates words or phrases with documents. The search operation then (i) uses that inverted index to find documents (e.g., web pages) containing various words of a search query, and (ii) ranks or orders the documents found in accordance with some heuristic(s). Given the large amount of information available, these three main functions are automated to a large extent.
Although it is believed that automating the indexing operation is the only way to make searching a large amount of diverse material feasible, automating indexing operations introduces some challenges. More specifically, one of the problems of automated indexing is that the World Wide Web may include the same information duplicated in different forms or at different places on the World Wide Web. For example, some content is xe2x80x9cmirroredxe2x80x9d at different sites on the World Wide Web. Such mirroring is used to alleviate potential delays when many users attempt to request the same information at the same time, and/or to minimize network latency (e.g., by caching web pages locally). Some content will have plain text and HTML (hyper-text markup language) versions so that users can render or download the content in a form that they prefer. Finally, some web pages aggregate or incorporate content available from another source on the World Wide Web.
When users submit a query to a search engine, most users do not want links to (and descriptions of) web pages that have duplicate information. For example, search engines typically respond to search queries by providing groups of ten results. If pages with duplicate content were returned, many of the results in one group may include the same content. Thus, there is a need for a technique to avoid providing search results to web pages having duplicate content.
Some duplicate avoidance techniques are effected during the automated indexing operation. Similar documents can be flagged by (i) defining a similarity measure between two documents, and (ii) defining the two documents as xe2x80x9cduplicatesxe2x80x9d if the similarity measure exceeds a predetermined threshold.
Unfortunately, however, often duplicate information may be found in documents that are not exactly the same or even very similar. For example: (i) identical content may be presented with different formatting. (e.g., plain text versus HTML); (ii) different headers and/or footers may be prepended and/or appended, respectively, to identical content; (iii) hit counters may be appended to identical content; (iv) last modified dates may be appended, to identical content; and (v) one web site may include a copy of content found elsewhere (e.g., as a part of a compilation or aggregation of content, or simply as an insertion). Cases (ii)-(iv) are illustrated by the Venn diagrams of FIGS. 1 and 2. FIG. 1 illustrates the case where a second document merely adds a small amount of information (e.g., a counter, a footer, etc.) to a first document, whereas FIG. 2 illustrates the case where a second document slightly changes some information (e.g., a last modified date) of a first document. The present invention may be used to detect such xe2x80x9cduplicatesxe2x80x9d with slight changes.
Furthermore, the present invention may be used to detect duplicate content within documents that have a lot of different information, such as documents with different formatting codes or documents that aggregate or incorporate other content. Many prior techniques are not well-suited for such cases. For example, assume that documents A and B each contain basic financial information about companies. Assume further that document A has information on 50 companies, while document B has information on 100 companies, at least some of which are the same as those in document A. (For example, document B could be a later, expanded version of document A.) The Venn diagrams of FIGS. 3 and 4 illustrate such examples.
Many known document similarity techniques would not consider documents A and B to be very similar even though they may contain a lot of identical content. A user searching for information about the 50 companies included in document A, however, would likely become frustrated if a search engine provides links not only to document A, but also to other documents (e.g., document B) that contain the same information about the 50 companies. The articles, A. Broder et al, xe2x80x9cSyntactic Clustering of the Web,xe2x80x9d Proc. 6th International WWW Conference (1997), A. Broder et al, xe2x80x9cFiltering Near-Duplicate Documents,xe2x80x9d FUN"" 98 and A. Broder et al, xe2x80x9cOn the Resemblance and Containment of Documents,xe2x80x9d SEQUENCES"" 98, pp. 21-29 (hereafter referred to as xe2x80x9cthe Broder articlesxe2x80x9d) describe a method (hereafter referred to as xe2x80x9cthe Broder methodxe2x80x9d) for detecting duplicate documents. The Broder method may be used to find documents that are xe2x80x9croughly the samexe2x80x9d and xe2x80x9croughly containedxe2x80x9d in each other. More specifically, for each pair of documents, the Broder method generates a number that indicates the extent to which the documents appear to be related. A threshold is then used to determine whether or not the two documents are related enough (or similar enough) to be declared xe2x80x9cduplicatesxe2x80x9d. The Broder method, however, does not consider the specific information that a user is looking for in its analysis.
In view of the foregoing, there is a need for an improved duplicate detection technique. Such a technique should be automated so that processing a large amount of content from a large number of sources is feasible.
The present invention provides an improved duplicate detection technique that uses query-relevant information to limit the portion(s) of documents to be compared for similarity. In other words, before comparing two documents for similarity, the content of these documents may be condensed based on the query. In one embodiment, query-relevant part(s) (also referred to as xe2x80x9csnippetsxe2x80x9d in one embodiment) are extracted from the documents and only the extracted query-relevant part(s), rather than the entire documents, are compared for purposes of determining similarity.
As can be appreciated by the foregoing summary, an improved duplicate detection technique under the present invention is preferably performed after indexing, during the processing of a particular search query. However, in systems in which at least some indexing is performed after receiving (or processing) a query, the present invention may be performed before such indexing.
By limiting the portion(s) of the documents being compared, a large range of duplicate document types, including those that would be missed by conventional similarity determination techniques, will be detected. Further, since only a portion(s) of the documents are compared, the similarity threshold can be set relatively higher, thereby decreasing the number of documents that would be falsely identified as duplicates if a lower threshold were used.
In the example set forth above, further assume that the documents A and B included identical information about company X (See the Venn diagrams in FIGS. 5 and 6.), and that a user submitted a query about company X. In accordance with the present invention, documents A and B would be considered duplicates with respect to a query about company X. Referring to FIG. 5, even prior art methods that can determine containment would probably conclude that document B is not xe2x80x9ccontainedxe2x80x9d in document A, notwithstanding the fact that both are similar (or even the same) with respect to company X. Referring to FIG. 2, assume that both the first and second documents contain information about company X, albeit different information. The query-specific method of the present invention may find that the two documents are not similar (with respect to company X). On the other hand, most, if not all, known techniques would find these documents similar since such techniques do not consider query-relevant information in their analysis.
Note that aside from documents that match each other exactly, whether or not documents are duplicates is somewhat subjective and application specific. Although the term xe2x80x9cduplicatesxe2x80x9d should be broadly interpreted, it should be understood that one goal of the present invention may be, in the context of a search engine for example, to avoid annoying users with different versions of information that add little or no value to the user once one of the versions is interpreted by the user.