A data center is a facility used to house computer systems and associated components, such as telecommunications and storage systems. A data center generally includes redundant or backup power supplies, redundant data communications connections, environmental controls (e.g., air conditioning, fire suppression), and various security devices. A data center is typically used to support one or more web sites that experience a significant amount of web traffic so that end-user requests may be serviced in a relatively short amount of time.
A data center may host multiple services, some of which are available to clients outside the data center and others of which are available to “clients” within the data center. If multiple data centers are used, then some services provided by one data center may be replicated in other data centers. However, some services may need to run (or be active) in only a single data center, primarily because such services need to write to a single database. For example, a service that handles credit card transactions writes to a database in one data center. The written data is eventually replicated to other data centers. If there were duplicate services in multiple data centers, then the duplicate services could be out of sync. To prevent issues (such as double charging or a service not being able to find a payment in the current data center), all requests for certain services are sent to a “single-master service.” A single-master service is a service that is active in only one data center at a time.
All clients of the service should know where the single-master service is located, regardless if the client is in the same data center as the single-master service or in another data center that does not host the single-master service.
For reliability, a copy of a single-master service is hosted in one or more other data centers. Such a copy is referred to a “slave service” and the one or more instances of the slave service are referred to as “slave instances.” The slave service and the slave instances are considered dormant or inactive until they are triggered to be active. Thus, if a slave instance receives a client request (e.g., from a client in the same or different data center as the slave instance), then the slave instance is configured to not process the client request and may return an error or decline message.
However, switching mastership of a single-master service (so that the master service becomes the new slave service and the old slave service becomes the new master service) may require a significant amount of time and may have a considerable effect on available of data (e.g., a web site) provided by the data centers. In one approach, switching mastership is a manual process that involves changing configurations that are to be read by all clients of the single-master service, changing server side configurations, running commands to notify all clients of the change, and restarting both the current master service and the current slave service. For example, each service may read a configuration file at start up that indicates whether the service is a single-master service and with which databases the service can communicate. Any changes to the status of master require one or more services to be shut down, one or more configuration files to be modified, and the one or more services to be restarted. Such a process might take a significant amount of time, especially if multiple services are involved (e.g., one hour). Thus, if a single-master service is a payment service, then payments could not be received during that entire time. Additionally, when one or more single-master services are offline, other services, in all data centers, are also impacted, leading to a degraded user experience at best, and inability to serve users at all in the worst case.
Another example of a single-master service is a sticky-routing service that ensures that each user/member of a web site is directed to the same data center regardless of which device the user/member uses to access the web site. If the sticky-routing service goes down for even a few minutes, then data integrity issues might arise in addition to problems with user registration and log in.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.