Online resources or content items (e.g., web pages, video files, audio files, documents, etc.) available on the Internet can be associated with a document identifier, e.g., a Uniform Resource Locator (URL), that can be used to identify and locate a content item. Because the Internet is organic and heterogeneous, often many distinct URLs point to the same content item. Thus, even though the URLs themselves are different, the data fetched from the distinct URLs can be identical. Because of this, a web crawler that loads URLs from link tags on web pages that point to content (e.g., to index, store, and make the content accessible via a search query) may download the identical content that is specified by two different URLs. Yet some kinds of resources (e.g., video files) require significant network resources to download compared with other online content (e.g., text-based HTML web pages), because these resources are intrinsically high bandwidth content. Because bandwidth is an expensive and limited resource, it is desirable to avoid downloading the same content more than once.
Therefore, there is a need for a system that automatically identifies and manages document identifiers that reference the same content and thereby reduces the waste of resources both on the search engine side and the web server side.