The World Wide Web (“web”) contains a vast amount of information that is ever-changing. Existing web-based information retrieval systems use web crawlers to identify information on the web. A web crawler is a program that exploits the link-based structure of the web to browse the web in a methodical, automated manner.
A web crawler may start with addresses (e.g., Uniform Resource Locators (URLs)) of links to visit. For each address on the list, the web crawler may visit the document associated with the address. The web crawler may identify outgoing links within the visited document and add addresses associated with these links to the list of addresses.
An indexer creates an index of the documents crawled by the web crawler. A problem that indexers face is how to handle duplicate content on the web. For example, the same document may appear duplicated or substantially duplicated in different forms or at different places on the web. Also, spammers oftentimes copy document content and pass this content off as their own.
It is undesirable for the indexer to index all of the duplicate documents. For example, indexing duplicate documents wastes space in the index. Also, indexing duplicate documents, and thus, making the duplicate documents available for serving as search results lead to an undesirable experience for the user. A user does not want to be presented with multiple documents containing the same, or substantially the same, content.
Thus, given a set of duplicate documents, an indexer may select one of these documents to index. Determining which of the duplicate documents to index is not an easy task because it would be undesirable for the indexer to select a document belonging to a spammer.