The World Wide Web (“web”) contains a vast amount of information that is ever-changing. Existing web-based information retrieval systems use web crawlers to identify information on the web. A web crawler is a program that exploits the link-based structure of the web to browse the web in a methodical, automated manner.
A web crawler may start with addresses (e.g., Uniform Resource Locators (URLs)) of links to visit. For each address on the list, the web crawler may visit the document associated with the address. The web crawler may identify outgoing links within the visited document and add addresses associated with these links to the list of addresses.
An indexer creates an index of the documents crawled by the web crawler. A problem that indexers face is how to handle duplicate content on the web. For example, the same document may appear duplicated or substantially duplicated in different forms or at different places on the web.
Another problem that indexers face is high frequency content changes for the same document. For example, a document may include some content (e.g., a random advertisement or a related links section) that changes frequently, and some content that does not change over time. The document may also be a duplicate of one or more other documents. If the document is crawled at two different points in time, the document may include different advertisements. The indexer may be provided with two versions of the document corresponding to the different crawl times. Due to the changing advertisement, the indexer may not appropriately identify one of the versions of the document as a duplicate of the one or more other documents.
Still another problem that indexers face is what may be referred to as “crawl skew.” For example, a document (e.g., a blog page) and its duplicate may include content that continuously grows over time and may cause crawl skew. The document and its duplicate may be crawled at two different points in time (e.g., the document may be crawled after the duplicate and may include new content not included in the duplicate). The indexer may be provided with the document and its duplicate. However, due to the new content, the indexer may not appropriately identify the document and the duplicate as a duplicates.
It is undesirable for the indexer to index all of the duplicate documents. For example, indexing duplicate documents wastes space in the index. Also, indexing duplicate documents, and thus making the duplicate documents available for serving as search results leads to an undesirable experience for the user. A user does not want to be presented with multiple documents containing the same, or substantially the same, content.