Crawler-based search engines may employ various algorithms to identify documents on the World Wide Web (“web”) relevant to search terms contained in a user's query. Typically, a crawler-based search engine will include a crawler, an indexer and a search engine. The crawler is a software tool that searches the web for content (e.g., documents) to deliver to the indexer. The crawler may be provided with a seed list of addresses (e.g., Uniform Resource Locators (URLs) or some other form of Uniform Resource Identifier (URI)). The crawler may visit a document corresponding to an address in the seed list and/or reference a robots.txt file (e.g., on a web site) that provides the crawler with a list of documents that are inaccessible. As the crawler accesses a document, the crawler may, among other things, extract outgoing links (e.g., hyperlinks) to other documents that are associated with the visited document. These outgoing links or addresses may be added to the seed list. The process of visiting documents may be repeated until the crawler decides to stop. The crawler may periodically return to these addresses so that if changes have been made to these documents, the indexer may be updated.
The indexer may create an index of the documents accessed by the crawler. For example, the indexer may catalog and maintain a copy of every document that the crawler discovers and/or a location of or a pointer to the document (e.g., a URL). The indexing process may be performed on a single device or on multiple devices. The search engine may sort through the information in the indexer and present the user with the most relevant results in a particular order (e.g., a descending order of relevance).
A problem that the indexer confronts is how to handle duplicate content on the web. For example, the same document may appear duplicated or substantially duplicated in different forms or at different places (e.g., different URLs) on the web. Accordingly, it is undesirable for the indexer to index duplicate documents because it may lead to search results that would be undesirable to the user since the user does not want to be presented with multiple documents in a search result that contain the same, or substantially the same, content. Further, indexing duplicate documents wastes resources (e.g., memory, processing, etc.).
Given a set of duplicate documents, various clustering processes may be employed to determine the most relevant documents to be indexed. By clustering documents together, the indexer can select a single document from the cluster to serve as the canonical document for indexing. In this regard, the clustering processes employed may affect the quality of the documents selectable by the indexer and presented as a search result to a user.