Information retrieval systems, generally called search engines, are used to search large collections of documents. In some information retrieval systems, documents are added under the supervision of editors or others to ensure that multiple different versions are not introduced into the document collection. However, for search engines operating on the Internet, there is no such supervisory control. Accordingly, it is typical that a particular document or portion thereof, appears in a number of different versions or forms in various online repositories. This generally results in multiple versions of a document being included in the search results for any given query. Because the inclusion of different versions of the same document does not provide additional useful information, this increase in the number of the search results does not benefit users. Also, search results including different versions of the same document may crowd out diverse contents that should be included. These problems have seriously affected the quality of a search result provided by a search engine.
Another problem arises in systems in which there are multiple versions of documents present. Documents in a document collection will have a number of citations to it by other documents. This is particularly the case for academic documents, legal documents, and the like. The number of citations (citation count) to a document is often reflective of the importance, significance, or quality of the document. Where there are different versions of a document present in a repository, each with its own citation count, a user does not have an accurate assessment of the actual significance, importance or quality of the document based on the individual citation counts.
For these reasons, it would be desirable to identify documents that are different versions of the same document in a document collection. It would also be desirable to manage these documents in an efficient manner such that the search engine can furnish the most appropriate and reliable search result.