1. Technical Field
This invention generally relates to computer systems, and more specifically relates to the servicing of client requests by data centers.
2. Background Art
The widespread proliferation of computers in our modern society has prompted the development of computer networks that allow computers to communicate with each other. With the introduction of the personal computer (PC), computing became accessible to large numbers of people. Networks for personal computers were developed that allow individual users to communicate with each other. In this manner, a large number of people within a company could communicate with other computers on the network.
One significant computer network that has recently become very popular is the Internet. The Internet grew out of this proliferation of computers and networks, and has evolved into a sophisticated worldwide network of computer system resources commonly known as the “world-wide-web”, or WWW. A user at an individual PC (i.e., workstation) that wishes to access the Internet typically does so using a software application known as a web browser. A web browser makes a connection via the Internet to other computers known as web servers, and receives information from the web servers that is displayed on the user's workstation.
The volume of business conducted via the Internet continues to grow at an exponential rate. Many on-line merchants do such a large volume of business that reliability of their computer systems is critical. Many such systems include data replication between different data centers. A data center includes one or more server computer systems that are responsible for servicing requests from clients. Data replication between different service centers allows a different server to take over in the event that the primary server fails.
Referring to FIG. 1, a prior art computer system 100 includes a geographical region Geo1 112 that is assigned to a particular client 110. Client 110 typically corresponds to a human user, but could be any suitable client. Client 110 has the ability to query the Dallas data center 130 or the Denver data center 140. In this particular example, the Dallas data center 130 is the primary data center for geographical region Geo1 112 that was assigned to the client 110, and the Denver data center 140 is the backup data center for Geo1 112. When the client 110 needs to access the data center corresponding to Geo1, it sends a domain name request to the Domain Name Server (DNS) 120 specifying the domain name, which returns the Internet Protocol (IP) address that corresponds to the domain name. In this specific example, the domain name is geo1.business.com. In response to the domain name request, the DNS returns the IP address of the Dallas data center 130. The client 110 now communicates directly with the Dallas data center 130. Note that the Dallas data center 130 and the Denver data center 140 are bi-directionally replicated, meaning that a change to either is propagated to the other, to keep the two data centers in sync with each other.
Now we consider what happens when a data center fails, meaning that the data in the data center is unavailable for some reason. Referring to FIG. 2, a prior art method 200 begins when the Dallas data center fails (step 210). In response to the failure, the system administrator of the Dallas data center 130 updates the IP address for the geo1.business.com domain name entry in the DNS 120 to point to the IP address for the Denver data center 140, which is the backup (step 220). In theory, and in looking at method 200 in FIG. 2, such an approach is very easy to implement. However, this prior art approach in method 200 suffers from severe shortcomings. For one thing, a DNS entry is often cached on a client. Thus, a change to a DNS entry will not be updated on the client until the client decides to refresh its cache. It is not uncommon to have a DNS cache entry refresh time specified in hours. Furthermore, DNS 120 is typically coupled to many other DNS servers, so it takes time for a change in one DNS to be propagated to a different DNS, and then to all the clients that DNS serves. As a result, even if the system administrator catches a failure in the Dallas data center 130 and immediately changes the DNS entry for the corresponding domain name, it can often take hours for this change to propagate through all DNS servers and into all clients that have cached DNS entries. While the DNS entries are being updated, many client requests may fail because the Dallas data center 130 has failed, and the address for the corresponding domain name geo1.business.com has not been fully propagated to all DNS servers. For a business like Amazon.com that relies so heavily upon its computer systems, the unavailability of a data center can cost the company tens of thousands of dollars per minute, which easily translates to millions of dollars in lost sales annually due to failed computer systems. Without a way to easily and quickly switch from one data center to another, computer systems will continue to cost businesses that rely upon these computer systems millions in lost revenue due to the unavailability of a failed data center.