1. Field of the Invention
Implementations described herein relate generally to information retrieval and, more particularly, to the detection and processing of near-duplicate documents when crawling a network.
2. Description of Related Art
The World Wide Web (“web”) contains a vast amount of information that is ever-changing. Existing web-based information retrieval systems use web crawlers to identify information on the web. A web crawler is a program that exploits the link-based structure of the web to browse the web in a methodical, automated manner.
A web crawler may start with addresses (e.g., URLs) of links to visit. For each address on the list, the web crawler may visit the document associated with the address. The web crawler may identify outgoing links within the visited document and add addresses associated with these links to the list of addresses.
A problem that web crawlers face is how to handle near-duplicate content on the web. For example, the same document may appear duplicated or substantially duplicated in different forms or at different places on the web. There are many sources of near-duplicate documents on the web. One source of near-duplicates includes documents that are “mirrored” at different sites on the web. Mirroring may be used to alleviate potential delays when many users attempt to access the same document at the same time and/or to minimize network latency (e.g., by caching web documents locally).
Another source of near-duplicates includes documents that have different versions with different formatting. For example, a document may have plain text and hypertext markup language (HTML) versions so that users can render or download the content in a form that they prefer. As additional types of devices are used to access the web (e.g., computers, mobile phones, personal digital assistants, etc.), a given document may have even more versions with different formatting (text only, text plus other media, etc.).
Yet another source of near-duplicates includes documents that are pre-pended or appended with information related to its location on the web, the date, the date it was last modified, a version, a title, a hierarchical classification path (e.g., a document may be classified under more than one class within the hierarchy of a web site), etc. A further source of near-duplicates includes documents that are generated from existing documents using a consistent word replacement. For example, a web site may be “re-branded” for different audiences by using word replacement. Another source of near-duplicates includes documents that aggregate or incorporate content available from other sources on the web. Yet other sources of near-duplicates may exist.
Because a near-duplicate document typically includes near-duplicate links, it may be beneficial for a web crawler to ignore the links in a near-duplicate document so as not to waste computer, storage, and/or network resources.