1. Field of the Invention
Implementations described herein relate generally to information retrieval and, more particularly, to detecting hostnames/subtrees that are mirrors of one another on the web.
2. Description of Related Art
The World Wide Web (“web”) contains a vast amount of information. A specific item of content on the web may often be accessible at multiple different addresses (e.g., uniform resource locators (URLs)). In some instances, a website has more than one hostname pointing to the same content. For example, the hostnames www.google.com and google.com may both point to the same content. In other instances, multiple names within a host may refer to the same content. For example, www.amazon.com/electronics/apple_ipod.html may refer to the same piece of content as www.amazon.com/products/company/apple/apple_ipod.html. In other instances, all of the content on one website may be the same as the content on another website. For example, all of the content under both www.whitehouse.gov/barney and www.barney.gov may be the same.
When multiple hostnames refer to the same content (i.e., the multiple hostnames are “mirrors” of one another), problems can be created for search engines that “crawl” and index content associated with the multiple hostnames. If, for example, a search engine does not recognize two hostnames, that refer to the same content, as being the same, the search engine will crawl and index pages from both hostnames. This wastes crawl bandwidth and index space, and puts twice the crawl load on the website with the two hostnames. Also, multiple hostnames that refer to the same content can create problems in ranking search results. Using existing ranking techniques, a given web page will be more highly ranked among other search results if it is pointed to by a large number of other pages. Therefore, if two hostnames, that refer to the same content, are treated separately for the purpose of ranking, the ranking of each hostname may only actually be about half what it would be if the hostnames were ranked together.