Through the use of the Internet and the World Wide Web (“the web”), individuals have access to billions of items of information. For example, the web provides access to items such as web pages, documents, images, e-mail messages, instant messaging messages, music, videos, etc., generally and collectively referred to herein as “searchable resources” or simply “resources.” However, a significant drawback with using the web is that, because there is so little organization to the web, at times it can be extremely difficult for users to locate the particular resources that contain the information that is of interest to them. To address this problem, a mechanism known as a “search engine” has been developed to index a large number of resources and to provide an interface that can be used to search the indexed information by entering certain words or phases to be queried. A search engine is a computer program designed to find resources stored in a computer system, such as the web or such as a user's desktop computer. The search engine's tasks typically include finding and analyzing resources (“crawling”), building a search index that supports efficient retrieval of crawled resources (“indexing”), and processing queries for information by using the search index to retrieve relevant resources (“query processing”).
With the explosive growth of resources available on the web, search engines require substantial computing resources in order to perform the tasks of crawling, indexing, and query processing. Often, largely for economic reasons, a search engine is deployed at a single-site, that is, at a single geographic location such as at a single data center with hundreds or even thousands of server computers for performing search engine tasks. A data center is a physical facility used to house server computer systems and associated components, such as telecommunications and storage systems. For a single-site search engine, the site of the search engine is often selected based on the cost of land, labor, and services, in particular electricity that is associated with establishing a data center at the selected site.
However, single site search engines suffer drawbacks that result from their singular locality. For example, since resources on the web are dispersed throughout the world, the task of crawling the web for resources from a single site is significantly affected by the geographic distances between the site and the computers containing the resources. This is because, in general, network connection time increases and data transfer rates decreases as the geographic distance between the connection endpoints increases. Thus, more servers are needed by a single-site search engine to perform the same crawling as the geographic distances between the site and the resources increases. Similarly, since queries for information may be sent to a search engine from computers all over the word, more servers at the single-site search engine are needed to handle the same query volume as the geographic distances between the querier and the site increases.
One solution to improve the performance of a single-site search engine is to distribute the search engine across multiple, geographically dispersed sites so that search sites are closer to the resources they crawl and the queriers for which they process queries. However, it is difficult to distribute the search engine in a manner that maximizes the number of queries answered locally, that is, without the site receiving the query having to communicate with another site to answer the query, while at the same time not sacrificing the quality of results that would be returned from a single-site search engine.
An example of a multi-site search engine architecture is a hub and spoke topology in which at most only two connections are needed if a query cannot be answered locally (i.e., one connection from the site receiving the query to the hub site and a second connection from the hub site to a spoke site that can answer the query). A hub and spoke topology suffers from a significant drawback however, namely, having to provision a hub site in such a way that it can handle more traffic compared to the spoke sites. As a result, a hub and spoke topology can be more costly than the single-site search engine it was designed to replace.