Information retrieval systems, generally called search engines, are used to search large collections of documents. In some information retrieval systems, documents are added under the supervision of editors or others to ensure that only one version of a document with multiple different versions is introduced into the document collection. However, for search engines operating on the Internet, there is no such supervisory control. Accordingly, it is typical that a particular document or portion thereof, appears in a number of different versions or forms in various online repositories. This generally results in multiple versions of a document being included in the search results for any given query. Because the inclusion of different versions of the same document does not provide additional useful information, this increase in the number of the search results does not benefit users. Also, search results including different versions of the same document may crowd out diverse contents that should be included. Furthermore, where there are multiple different versions of a document present in the search results, the user may not know which version is most authoritative, complete, or best to access, and thus may waste time accessing the different versions in order to compare them. These problems have seriously affected the quality of search results provided by a search engine.
For these reasons, it would be desirable to identify a primary version out of different versions of the same document in a document collection. It would also be desirable to use the primary version to represent the document such that the search engine can furnish the most appropriate and reliable search result.