Information retrieval systems are widely used by users to search for information on a given subject. Web search systems are an example of one type of information retrieval system. Users submit a query to the web search system and obtain a list of results comprising links to documents that are relevant to the entered query.
However, the web contains many duplicate and near-duplicate documents. Given that user satisfaction is negatively affected by redundant information in search results, a significant amount of research has been devoted to developing duplicate detection algorithms. However, most such algorithms rely solely on document content to detect duplication/redundancy, ignoring the fact that a primary goal of duplicate detection is to identify documents that contain redundant information with respect to a given user query.
Previous techniques for identifying duplicates are based on identifying similarities between document contents. Since discovering all possible duplicate documents in a document set of size N uses O(N2) comparisons, efficiency as well as accuracy are two main concerns of existing algorithms. The simplest approach for detecting exact duplicates is based on a fingerprint that is a succinct digest of the characters in a document. When the fingerprints of two documents are identical, the documents are further compared, and identical documents are identified as duplicates. This technique does not identify near duplicates: web pages that are not identical but still very similar in content. Previous algorithms for identifying near duplicates are based on generating n-gram vectors from documents and computing a similarity score between these vectors based on a certain similarity metric. If the similarity between two documents is above a threshold, the two documents are considered to be near duplicates of each other.
All these techniques for duplicate detection are based on using the contents of the documents. Methods that solely depend on similarities in terms of document contents do not identify documents that contain similar information with respect to a user need. That is, in most cases, duplicate detection is aimed at identifying documents that are of the same utility to an end user. However, when only document contents are used for duplicate detection, utility is ignored. Two documents can be of the same utility (containing duplicate information) even if the contents are different. For example, two newspaper articles describing exactly the same event but with different words are often duplicates of each other, and hence users who have read one of these may not be interested in reading the other one. Furthermore, two documents can be of different utility to an end user even if their contents are very similar. For example, two different documents containing a biography of Britney Spears, identically written except that one contains the birthday of Britney Spears while the other does not are not duplicates of each other when the goal of the user is to find out Britney Spears' age.
The embodiments described below are not limited to implementations which solve any or all of the disadvantages of known information retrieval systems.