This invention relates to text processing and more particularly, to determining the originality of content.
A search engine allows users to search for relevant documents contained in a corpus of documents. Typically, the search engine generates a list of documents in response to a search query. The order in which documents in the list of documents are presented is typically dependant on the relevance, or rank, of each document.
A particular document in the corpus can be ranked based on the extent to which other documents in the corpus reference the particular document. The explicit references (e.g., hyperlinks) of all documents in the corpus can be counted and recorded to determine the rank of a document. Counting the explicit references to a document does not capture whether the content of the document is unique with respect to the other documents. Some documents may contain identical or nearly identical content. Search results that include a document can also contain all of its copies or near copies. Even though each copy is a separate document, each copy actually provides little or no further information to information seeking users. The proliferation of search results that contain nearly identical content can obscure other search results that contain unique content.