Networked computer systems enable users to share resources and services. One computer can request and use resources or services provided by another computer. The computer requesting and using the resources or services provided by another computer is typically known as a client, and the computer providing resources or services to another computer is known as a server.
A group of independent network servers may be used to form a cluster. Servers in a cluster are organized so that they operate and appear to clients as if they were a single unit. A cluster and its network may be designed to improve network capacity by, among other things, enabling the servers within a cluster to shift work in order to balance the load. By enabling one server to take over for another, a cluster may be used to enhance stability and minimize downtime caused by an application or system failure.
Today, networked computer systems including clusters are used in many different aspects of our daily lives. They are used, for example, in business, government, education, entertainment, and communication. As networked computer systems and clusters become more prevalent and our reliance on them increases, it has become increasingly more important to achieve the goal of continuous availability of these “high-availability” systems.
High-availability systems need to detect and recover from a failure in a way transparent to its users. For example, if a server in a high-availability system fails, the system should detect and recover from the failure with no or little impact on clients.
Various methods have been devised to achieve high availability in networked computer systems including clusters. For example, one method known as triple module redundancy, or “TMR,” is used to increase fault tolerance at the hardware level. Specifically, with TMR, three instances of the same hardware module concurrently execute and, by comparing the results of the three hardware modules and using the majority results, one can detect a failure of any of the hardware modules. However, TMR does not detect and recover from a failure of software modules. Another method for achieving high availability is software replication, in which a software module that provides a service to a client is replicated on at least two different nodes in the system. While software replication overcomes some disadvantages of TMR, software replication suffers from its own problems, including the need for complex software protocols to ensure that all of the replicas have the same state.
Replication of hardware or software modules to achieve high-availability raises a number of new problems including management of replicated hardware and software modules. The management of replicas has become increasingly difficult and complex, especially if replication is done at the individual software and hardware level. Further, replication places a significant burden on system resources.
When replication is used to achieve high availability, one needs to manage redundant components and have an ability to assign work from failing components to healthy ones. However, telling a primary component to restart or a secondary component to take over is not sufficient to ensure continuity of services. To achieve a seamless fail-over, the successor needs to resume operations where the failing component stopped functioning. As a result, secondary components need to know the last stable state of the primary component.
What is needed is a way to quickly recover from failure of one or more nodes, applications, and/or communication links in a distributed computing environment, such as a cluster. Preferably, an application that was running on the failed node can be restarted in the state that the application had before the node failed. These capabilities should have little or no effect on performance of applications.