The World Wide Web is a large, distributed, decentralized collection of documents. Documents (often referred to as “web resources” or “web pages”) can be downloaded from computers called “web servers”; there are tens of millions of web servers serving billions of web pages. Each web page is identified by a uniform resource locator (URL). A URL is of the form http://host:port/path where the host component identifies the web server that serves the document associated with the URL, and the path component provides a name for that document relative to the host. The port component identifies the networking “port” (an Internet abstraction used to multiplex different logical communication channels over the same physical networking device) used by the web server running the specified host; if the port is omitted, it defaults to 80.
Web crawlers traverse web sites and download all pages referenced by the URLs of the web site. However, many web sites use different URLs to reference the same web page or document, for various reasons. It is quite common for the same document to be identified by several and possibly many URLs. For example, the following sixteen URLs, although all different, all refer to substantially the same web page:                1. http://www.marketwatch.com/news/yhoo/story.asp?source=blq/yhoo&siteid=yho o&dist=yhoo&guid=%7B5D426EE8%2DBB62%2D457C%2DA82E%2D05EE3F6F16C8%7D        2. http://www.marketwatch.com/news/story.asp?source=blq/yhoo&siteid=yhoo&dis t=yhoo&guid=%7B5D426EE8%2DBB62%2D457C%2DA82E%2D05EE3F6F16C8%7D        3. http://www.marketwatch.com/news/yhoo/story.asp?siteid=yhoo&dist=yhoo&guid=%7B5D426EE8%2DBB62%2D457C%2DA82E%2D05EE3F6F16C8%7D        4. http://www.marketwatch.com/news/yhoo/story.asp?source=blq/yhoo&dist=yhoo &guid=%7B5D426EE8%2DBB62%2D457C%2DA82E%2D05EE3F6F16C8%7D        5. http://www.marketwatch.com/news/yhoo/story.asp?source=blq/yhoo&siteid=yhoo&guid=%7B5D426EE8%2DBB62%2D457C%2DA82E%2D05EE3F6F16C8%7D        6. http://www.marketwatch.com/news/yhoo/story.asp?source=blq/yhoo&guid=%7B 5D426EE8%2DBB62%2D457C%2DA82E%2D05EE3F6F16C8%7D        7. http://www.marketwatch.com/news/yhoo/story.asp?siteid=yhoo&guid=%7B5D426EE8%2DBB62%2D457C%2DA82E%2D05EE3F6F16C8%7D        8. http://www.marketwatch.com/news/yhoo/story.asp?dist=yhoo&guid=%7B5D426EE8%2DBB62%2D457C%2DA82E%2D05EE3F6F16C8%7D        9. http://www.marketwatch.com/news/story.asp?source=blq/yhoo&siteid=yhoo&guid=%7B5D426EE8%2DBB62%2D457C%2DA82E%2D05EE3F6F16C8%7D        10. http://www.marketwatch.com/news/story.asp?source=blq/yhoo&dist=yhoo&guid=%7B5D426EE8%2DBB62%2D457C%2DA82E%2D05EE3F6F16C8%7D        11. http://www.marketwatch.com/news/story.asp?siteid=yhoo&dist=yhoo&guid=%7B5D426EE8%2DBB62%2D457C%2DA82E%2D05EE3F6F16C8%7D        12. http://www.marketwatch.com/news/yhoo/story.asp?guid=%7B5D426EE8%2DB B62%2D457C%2DA82E%2D05EE3F6F16C8%7D        13. http://www.marketwatch.com/news/story.asp?source=blq/yhoo&guid=%7B5D426EE8%2DBB62%2D457C%2DA82E%2D05EE3F6F16C8%7D        14. http://www.marketwatch.com/news/story.asp?siteid=yhoo&guid=%7B5D426EE8%2DBB62%2D457C%2DA82E%2D05EE3F6F16C8%7D        15. http://www.marketwatch.com/news/story.asp?dist=yhoo&guid=%7B5D426EE8%2DBB62%2D457C%2DA82E%2D05EE3F6F16C8%7D        16. http://www.marketwatch.com/news/story.asp?guid=%7B5D426EE8%2DBB62%2D457C%2DA82E%2D05EE3F6F16C8%7D        
These web pages are therefore downloaded duplicatively by a web crawler. This is a concern as superfluous downloads waste bandwidth and computational resources of both the web server (operated by the web content provider) and the web crawler (operated by the search engine). Such wasteful behavior is undesirable.
Web crawlers can download only a finite number of documents or web pages in a given amount of time. Therefore, it would be advantageous if a web crawler could identify URL equivalence patterns in multiple different URLs that reference substantially identical pages and download only one document, as opposed to downloading all the substantially identical documents addressed by the multiple different URLs.
In view of the foregoing, there is a need for systems and methods that overcome such deficiencies.