Crawler-based search engines may employ various algorithms to identify documents on the World Wide Web (“web”) relevant to search terms contained in a user's query. Typically, a crawler-based search engine will include a crawler, an indexer and a search engine. The crawler is a software tool that searches the web for content (e.g., documents). The crawler may be provided with a seed list of addresses (e.g., Uniform Resource Locators (URLs) or some other form of Uniform Resource Identifier (URI)) to visit based on one or more search criteria. The crawler may visit a document corresponding to an address in the seed list and/or reference a robots.txt file (e.g., on a web site) that provides the crawler with a list of documents that are inaccessible. As the crawler spiders a document, the crawler may, among other things, extract outgoing links to other documents (e.g., hyperlinks) that are associated with the visited document. These outgoing links or addresses may be added to the seed list. The process of visiting documents may be repeated until the crawler decides to stop. The crawler may periodically return to these addresses so that if changes have been made to these documents, the indexer may be updated.
The indexer may create an index of the documents crawled by the crawler. For example, the indexer may catalog and maintain a copy of every document that the crawler discovers and/or a location of or a pointer to the document (e.g., a URL). The search engine may sort through the information in the indexer and present the user with the most relevant results in a particular order (e.g., a descending order of relevance).
A problem that the indexer confronts is how to handle duplicate content on the web. For example, the same document may appear duplicated or substantially duplicated in different forms or at different places (e.g., different URLs) on the web. Accordingly, it is undesirable for the indexer to index duplicate documents because it may lead to search results that would be undesirable to the user since the user does not want to be presented with multiple documents in a search result that contain the same, or substantially the same, content. Further, indexing duplicate documents wastes resources (e.g., memory, processing, etc.).
Given a set of duplicate documents, an indexer may select one of these documents to index. However, determining which of the duplicate documents to index can be difficult. Additionally, given the volume of documents that an indexer may be processing, differences in freshness among the documents exists. For example, while the crawler may re-visit documents to determine if changes have been made, the crawler may not re-visit all of the documents at the same time and/or with the same frequency.