High-availability (HA) architectures are computer systems designed to, as best as possible, ensure continuous data and application availability, even when application components fail. These systems are typically used for applications that have a high cost associated with every moment of downtime. Example applications include Wall Street trading software (e.g., investment firms) and transportation/logistics tracking (e.g., package delivery companies). Since occasional failures are unavoidable, it is extremely important to reduce the amount of time it takes to recover from a failure in these systems.
The most common failure to occur in HA systems is an individual machine failure. Here, one of the machines or components in the system will stop working. In order to protect against such failures, redundant machines or components are commonly used. FIG. 1 is a figure illustrating a typical redundant architecture for a database application. A pool of application servers processes requests from clients. If one of the application servers fails, another application server is available to take its place. The application servers, in turn, retrieve and modify data from a database. To ensure that the HA system continues to operate even if a database fails, multiple database server components are organized into an operating system level cluster 100. In this case, two database servers 102, 104 are configured as a cluster. The standby database 104 is kept in a running state, and in case of failure it automatically steps in for the primary database 102. The standby database 104 is alerted to a failure in the primary database 102 when it fails to receive a heartbeat signal. The standby database 104 is kept up-to-date by periodic database-level or disk-level replication of the primary database 102.
The main drawback of these types of architectures, however, is that the time to recover is lengthy. The standby database 104 needs to process the transaction and recovery logs left behind by the primary database 102 before it can start servicing requests. This results in an unacceptably long failover time (typically several minutes).
What is needed is a solution that reduces failover time to an acceptable level.