The present invention relates generally to a method and apparatus for finding mirrored hosts and, specifically, to a method and apparatus for finding mirrored hosts by analyzing connectivity and naming structures of the host and of the web pages of the host.
In recent years, the World Wide Web ("the web") has grown hugely in popularity and use. Currently, almost any type of information can be found on the web if one knows where to look. Knowing where to look has increasingly become problematic because the number of web sites that make up the web have grown at an astounding rate since the early 1990s. In recent years, increasingly sophisticated software, such as search engines and web browsers have been developed that allow users of the web to locate information in the web. Other software, such as proxy servers improve the speed and security of web usage.
A web crawler is a software program that fetches a set of pages from the web by following hyperlinks between the pages. Search engines, such as Compaq Computer Corporation's Alta Vista search engine, employ crawlers to build the web page indexes used by the search engine. Web browsers are applications that fetch and display web pages to a user. Proxy servers (proxies) fetch web pages from web server systems on behalf of web browsers. For efficiency reasons, proxies and browsers sometimes cache web pages (that is, store their content locally). Thus, if a cached page is requested a second time, it can be retrieved from a local cache.
A page in the web is accessed by its web address, also called a Uniform Resource Locator (URL). The URL of a web page has three parts: 1) an access type (such as "http") 2) a host name, which identifies the host on which the page is stored, and 3) a path, which specifies a location within the host. A web site is made up of one or more web pages. As shown in FIG. 7, the format of a URL of a web page looks like: EQU &lt;access type&gt;://&lt;host&gt;/&lt;path&gt;
where &lt;host&gt; is the name of the web server that stores the web site and &lt;path&gt; is the path of the page within the web server.
It has become increasingly common to duplicate all or part of certain popular web sites. For example, download hosts for certain popular software are often "mirrored" so that users can obtain the same downloadable software from any one of the mirrored hosts. Mirroring is the systematic replication of content across hosts. Mirroring happens when distinct hosts provide access to copies of the same data. Because mirrored hosts allow users to obtain the same information from any of the mirrored hosts, mirroring helps avoid bottlenecks at popular hosts. Hosts are mirrored for a variety of other reasons. Mirrored hosts may have identical page structures or they may contain only certain pages and page structures that are identical. In this document, two separate tests are used when determining mirrors. In a first test, two hosts, A and B are "mirrors" if and only if for every document on host A there is a highly similar document on B with the same path and vice versa. A second test categorizes pairs of hosts according to a plurality of mirroring categories, where the categories represent degrees of miroring. The two hosts do not have to be exactly matched in structure and/or content to be mirrored hosts.
Crawlers, search engines, and proxy servers all fetch large numbers of pages on the web. If these programs could detect mirroring in hosts, they could refrain from fetching content from all but one of the mirrored hosts, thus reducing the number of pages fetched and improving their overall performance. Given a large list of URLs encountered on the web (such as a list collected by a crawler of the list of URLs viewed by a central proxy of a large Internet Service Provider) it is desirable to be able to determine which hosts are mirrored. Some specific examples are provided below.
Often search engines index only one copy of a mirrored page. In the process, they may fetch replicas and discard them. If mirroring information were available, a search engine could avoid fetching replicas from known mirrored hosts. The search engine could also distribute fetches of the remaining pages between the mirrors for load balancing, or choose the best mirror in terms of response time.
Proxy servers and web browsers maintain cached copies of downloaded pages to avoid re-fetching. The effectiveness of such caches can be increased if mirroring information is available. When a URL needs to be fetched, the cache is first checked. If a requested page has not yet been fetched, but it is determined that a page from a mirrored host with the same path has been fetched and is available in the cache, the cached mirror page can be used instead of fetching the requested page.
Thus, the ability to identify mirrored hosts would improve the speed and efficiency of operation of software accessing the world wide web.
Certain conventional web crawling software are able to identify some mirrored web sites by using Domain Name Server (DNS) lookup. When a crawler fetches a URL, it needs to first convert the hostname of the URL to a corresponding Internet Protocol (IP) address to establish a network connection. Such lookups are done using a service known as DNS. A DNS lookup returns one or more IP address for each hostname. Crawlers usually treat hosts that have an IP addresses in common as mirrors to avoid redundant fetching. This method does not always identify all mirrored hosts and may mis-identify some hosts as mirrored that are not mirrored. For example, a "virtual host" is a host that hosts more than one web site but has a single IP address. The web sites hosted by a virtual host web server, while all having the same IP address are not necessarily mirrors. Similarly, not all mirrored hosts share a common IP address. In addition, some hosts may have more than one IP address. Thus, IP matching alone is not always sufficient to prove that two hosts are mirrors of each other.