The present invention relates generally to fault-tolerant computer systems. More specifically, the present invention includes a method and apparatus that allows complex applications to rapidly recover in the event of hardware or software failures.
Reliability is an important aspect of all computer systems. For some applications, reliable computer operation is absolutely crucial. Telephone switching systems and paging systems are good examples of systems where reliable computer operation is paramount. These systems typically operate on a continuous, or near continuous basis. Failures, for even short time periods, may result in a number of undesirable consequences including lost or reduced service or customer inconvenience, with great losses in revenue.
Fault-tolerant computer systems are computer systems that are designed to provide highly reliable operation. One way of achieving fault-tolerance is through the use of redundancy. Typically, this means that a backup computer system takes over whenever a primary computer system fails. Once a backup computer system has assumed the identity of a failed primary computer system, applications may be restarted and service restored.
The use of redundancy is an effective method for achieving fault-tolerant computer operation. Unfortunately, most redundant computer systems experience considerable delay during the failover process. This delay is attributable to the time required to perform the failover and the time required to restart the applications that have been terminated due to a system or software failure. In cases where complex applications are involved, this delay may amount to minutes or even hours. In many cases, delays of this length are not acceptable.
Process-pairs is an effective method for quickly restoring service that was interrupted by a system failure. For a typical process-pair implementation, a process is replicated between two computer systems. One of the processes, the primary process (running on one of the computer systems), provides service, while the other, the backup process (running on the other computer system), is in a standby mode. At periodic times, the state of the primary and backup processes are synchronized, or checkpointed. This allows the backup process to quickly restore the service that was provided by the primary process in the event of a failure of the primary process or of the computer system where it was running.
Process-pairing greatly reduces delays associated with restarting terminated processes. Unfortunately, many complex applications are designed as groups of separate processes. As a result, configuring complex applications to provide process-pair protection may be a difficult task. This difficulty results partially from the need to provide backup processes for each of the processes included in an application. The interdependence of the various processes included in complex applications also contributes to the overall difficulty of providing process-pair protection.
Based on the preceding discussion, it may be appreciated that there is a need for systems that provide process-pair operation for complex applications. Preferably, these methodologies would minimize the amount of specialized design and implementation required for process-pair operation. This is especially important for legacy applications where large scale modifications may be difficult or impractical.
The present invention provides a method and apparatus for providing process-pair protection to complex applications. A representative environment for the present invention includes two computer systems connected within a computer network or computer cluster, each one executing an instance of a protected application. One application instance is the primary application, and the other is the backup application. The primary application is providing service, while the backup application does not provide service. The backup application, however, is initialized and ready to take over in case of a failure of the primary application or of the computer system where it is running.
Each application instance is managed by an instance of a process called the Process-Pairs Manager (PPM). For convenience, these instances are referred to as the primary PPM and the backup PPM. Each PPM includes an Application State Model (ASM), an Interapplication Communication module (IAC), an Application Administration module (MD) and a Main module.
Each PPM uses its IAC to communicate with the other PPM. This allows each PPM to monitor the state of the application managed by the other PPM. Each PPM also uses its IAC to monitor the health of the computer system (primary or backup) that hosts the other PPM and its protected application instance. By monitoring application state and system health, each PPM determines when the remote application instance is no longer operable. When the primary application instance stops providing service, the PPM managing the backup application instance detects the fact and begins failover processing. Failover is the operation through which the PPM managing the backup application instance take steps to drive its managed application instance to primary state.
Each PPM uses its MD to manage the details of the application for which the PPM is responsible (i.e., the application for which the PPM provides process-pair protection). The internal details of a managed application (such as its startup and shutdown programs, maximum time interval values for state transitions, as well as resources associated with the application) are described in a configuration file. The AAD that manages a particular application reads the configuration file at PPM startup time to obtain this information.
Each PPM uses its ASM to define a set of states. For the described embodiment, two main states_enabled and disabled_are defined. The main states are themselves decomposed into finer granularity states. The main state enabled includes the init (application initialization state), configured, primary, backup and maintenance states. The main state disabled includes a down, a degraded and a failed state. The ASM also defines a set of conditions that trigger transitions between states. Given a state, if a certain set of conditions becomes valid, a transition to another specific state occurs. Each transition may have one or more actions associated with it. Actions are steps or procedures that are invoked by the ASM in response to a transition between states.
The ASM operates as a finite state machine. This means that the ASM begins operation by assuming a well-defined initial state. The initial state is determined by information provided by the PPM state file and can be either state down or state init. The ASM monitors various conditions, such as operator commands, application state and system health (the last two being monitored via the IAC). When a change in such conditions triggers a transition that is defined for the current state, the ASM changes its current state to the next defined state. As part of this transition, the ASM invokes any action associated with the transition from current state to the next state. These actions affect the application instance protected by the PPM by managing resources and commanding the application to change state. After each state transition the PPM checkpoints its new internal state.
At PPM startup, the AAD reads the application configuration file to determine how to startup the application that is to be given process-pair protection (i.e., the PPM determines which processes need to be started, etc.), and to acquire specific information that guides the management of the application. Assuming that the initial state is init, the PPM then starts the processes required by the application being given process-pair protection. Once the processes have been started, the PPM checkpoints its internal data structures.
Each started process registers itself with the PPM through a registration message. During process registration the PPM connects to the other PPM that is running concurrently on the other computer system. When all processes have registered with the PPM the ASM transitions to state configured. Until this point the two PPMs running on the two systems behave exactly the same.
When state configured is reached, each of the two PPMs determine the next state of its managed application instance. The application configuration file contains information that determines which PPM will drive its protected application instance to primary state, and which will drive its protected application instance to backup state. After this determination, the ASMs of both PPM change states. The ASM of the PPM that is supposed to be primary transitions to state primary. This causes the PPM to send a message to each application process commanding it to become primary. The ASM of the PPM that is supposed to be backup transitions to the backup state. This causes the PPM to send a message to each application process commanding it to become backup.
After startup, the primary and the backup application instances (each running on a distinct computer system) operate as a pair. The primary application processes, as they provide service, periodically checkpoint their state to the computer system where the backup application is running. Conditions such as an operator command, a failure of the primary application, or a failure of the computer system where the primary application runs, cause a failover to occur. This allows the backup application to replace the primary application as the service provider. Failover is accomplished rapidly. The backup application, which is already initialized, becomes primary by acquiring the necessary state information that was checkpointed by the primary application and continuing processing from the point where the failed primary application was interrupted. In this way, the present invention provides a method and apparatus that provides process-pair protection to complex applications. This allows a complex application to function in a fault-tolerant fashion, which minimizes the delays associated with system failure and recovery.
The maintenance state has the purpose of allowing operators to perform tests on a new version of the application. A newly installed version of the application, running as a backup application instance, is driven to state maintenance by an operator command. This state change does not interfere with the operation of the primary application. After test completion, the application is driven to state backup by another operator command. During state maintenance the application cannot become primary. A failure of the primary application, or of the computer system where it runs, when the other application instance is in state maintenance, causes service interruption because failover cannot occur.
Advantages of the invention will be set forth, in part, in the description that follows and, in part, will be understood by those skilled in the art from the description herein. The advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims and equivalents.