The present invention relates generally to a method and apparatus for finding mirrored hosts and, specifically, to a method and apparatus for finding mirrored hosts by analyzing connectivity and naming structures of the host and of the web pages of the host.
In recent years, the World Wide Web (xe2x80x9cthe webxe2x80x9d) has grown hugely in popularity and use. Currently, almost any type of information can be found on the web if one knows where to look. Knowing where to look has increasingly become problematic because the number of web sites that make up the web have grown at an astounding rate since the early 1990s. In recent years, increasingly sophisticated software, such as search engines and web browsers have been developed that allow users of the web to locate information in the web. Other software, such as proxy servers improve the speed and security of web usage.
A web crawler is a software program that fetches a set of pages from the web by following hyperlinks between the pages. Search engines, such as Compaq Computer Corporation""s AltaVista search engine, employ crawlers to build the web page indexes used by the search engine. Web browsers are applications that fetch and display web pages to a user. Proxy servers (proxies) fetch web pages from web server systems on behalf of web browsers. For efficiency reasons, proxies and browsers sometimes cache web pages (that is, store their content locally). Thus, if a cached page is requested a second time, it can be retrieved from a local cache.
A page in the web is accessed by its web address, also called a Uniform Resource Locator (URL). The URL of a web page has three parts: 1) an access type (such as xe2x80x9chttpxe2x80x9d) 2) a host name, which identifies the host on which the page is stored, and 3) a path, which specifies a location within the host. A web site is made up of one or more web pages. As shown in FIG. 7, the format of a URL of a web page looks like:
 less than access type greater than :// less than host greater than / less than path greater than 
where  less than host greater than  is the name of the web server that stores the web site and  less than path greater than  is the path of the page within the web server.
It has become increasingly common to duplicate all or part of certain popular web sites. For example, download hosts for certain popular software are often xe2x80x9cmirroredxe2x80x9d so that users can obtain the same downloadable software from any one of the mirrored hosts. Mirroring is the systematic replication of content across hosts. Mirroring happens when distinct hosts provide access to copies of the same data. Because mirrored hosts allow users to obtain the same information from any of the mirrored hosts, mirroring helps avoid bottlenecks at popular hosts. Hosts are mirrored for a variety of other reasons. Mirrored hosts may have identical page structures or they may contain only certain pages and page structures that are identical. In this document, two separate tests are used when determining mirrors. In a first test, two hosts, A and B are xe2x80x9cmirrorsxe2x80x9d if and only if for every document on host A there is a highly similar document on B with the same path and vice versa. A second test categorizes pairs of hosts according to a plurality of mirroring categories, where the categories represent degrees of miroring. The two hosts do not have to be exactly matched in structure and/or content to be mirrored hosts.
Crawlers, search engines, and proxy servers all fetch large numbers of pages on the web. If these programs could detect mirroring in hosts, they could refrain from fetching content from all but one of the mirrored hosts, thus reducing the number of pages fetched and improving their overall performance. Given a large list of URLs encountered on the web (such as a list collected by a crawler of the list of URLs viewed by a central proxy of a large Internet Service Provider) it is desirable to be able to determine which hosts are mirrored. Some specific examples are provided below.
Often search engines index only one copy of a mirrored page. In the process, they may fetch replicas and discard them. If mirroring information were available, a search engine could avoid fetching replicas from known mirrored hosts. The search engine could also distribute fetches of the remaining pages between the mirrors for load balancing, or choose the best mirror in terms of response time.
Proxy servers and web browsers maintain cached copies of downloaded pages to avoid re-fetching. The effectiveness of such caches can be increased if mirroring information is available. When a URL needs to be fetched, the cache is first checked. If a requested page has not yet been fetched, but it is determined that a page from a mirrored host with the same path has been fetched and is available in the cache, the cached mirror page can be used instead of fetching the requested page.
Thus, the ability to identify mirrored hosts would improve the speed and efficiency of operation of software accessing the world wide web.
Certain conventional web crawling software are able to identify some mirrored web sites by using Domain Name Server (DNS) lookup. When a crawler fetches a URL, it needs to first convert the hostname of the URL to a corresponding Internet Protocol (IP) address to establish a network connection. Such lookups are done using a service known as DNS. A DNS lookup returns one or more IP address for each hostname. Crawlers usually treat hosts that have an IP addresses in common as mirrors to avoid redundant fetching. This method does not always identify all mirrored hosts and may mis-identify some hosts as mirrored that are not mirrored. For example, a xe2x80x9cvirtual hostxe2x80x9d is a host that hosts more than one web site but has a single IP address. The web sites hosted by a virtual host web server, while all having the same IP address are not necessarily mirrors. Similarly, not all mirrored hosts share a common IP address. In addition, some hosts may have more than one IP address. Thus, IP matching alone is not always sufficient to prove that two hosts are mirrors of each other.
The described embodiment of the present invention addresses not only the problem of finding identical mirror hosts, but also the problem of finding hosts that are not completely identical, but contain a significant amount of shared content. This information is useful in understanding the composition of the web and the collaborations ongoing between principals on the web.
The described embodiment of the invention detects mirrored host pairs using information about a large set of pages, including one or more of: URLs, IP addresses, and connectivity information. The identities of the detected mirrored hosts are then saved so that browsers, crawlers, proxy servers, or the like can correctly identify mirrored web sites. The described embodiments of the present invention use one or a combination of techniques to identify mirrors. A first group of techniques involves determining mirrors based on URLs and information about connectivity (i.e., hyperlinks) between pages. A second group of techniques looks at connectivity information at a higher granularity, considering all links from all pages on a host as one group and ignoring the target of each link beyond the host level.
In accordance with the purpose of the invention, as embodied and broadly described herein, the invention relates to a method of determining mirrored web sites, comprising: receiving information about a plurality of web sites stored on a plurality of hosts; determining a list of host pairs that are potentially mirrored hosts; and analyzing the list of pairs of potential mirrored hosts to determine which of the host pairs are mirrored hosts.
In further accordance with the purpose of the invention, as embodied and broadly described herein, the invention relates to a method of determining mirrored web hosts, comprising: receiving information about the IP addresses of a plurality of web sites stored on a plurality of hosts; determining clusters of hosts, where all web sites in a cluster have the same IP addresses; and determining that the hosts in clusters of hosts having less than or equal to a threshold number of hosts therein are mirrored web hosts.
In further accordance with the purpose of the invention, as embodied and broadly described herein, the invention relates to a method of determining mirrored web hosts, comprising: receiving information about the addresses of a plurality of web sites stored on a plurality of hosts and about page level connectivity information of the plurality of web sites and a list of potentially mirrored hosts pairs; and filtering the list of potential mirrored hosts pairs in accordance with the page level connectivity information.
In further accordance with the purpose of the invention, as embodied and broadly described herein, the invention relates to a method of determining mirrored web hosts, comprising: receiving information about the addresses of a plurality of web sites stored on a plurality of hosts and about connectivity information of the plurality of web sites; for each host, determining a set of terms for the host, indicating those hosts that are targets of incoming links from some page on the host; for each term, determining the frequency, which equals the number of such incoming links; for each host, selecting the terms with the highest frequency; for each host, weighting the terms; and using term vector matching to determine the likelihood of a pair of hosts being mirrors in accordance with the weighted terms of the pair of hosts.
Advantages of the invention will be set forth in part in the description which follows and in part will be obvious from the description or may be learned by practice of the invention. The objects and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims and equivalents.