1. Technical Field of the Present Invention
The present invention relates generally to systems having redundant elements for the purpose of replacing a failed element with a functional element. More specifically, the present invention relates to a class of solutions known as failover systems. These systems are targeted at ensuring the continued operation of a system when an element in the system fails. In failover systems, each element will normally have a redundant element to allow for such replacement to take place.
2. Description of the Related Art
There will now be provided a discussion of various topics to provide a proper foundation for understanding the present invention.
In modern computer data processing, separating an application program into cooperating portions and running each portion on a different processing device within a computer network improves the execution efficiency of the application program. The cooperating portions of the application program are each run as a detached process on a specific processing device. The cooperating portions may be active in a serial fashion (i.e., one at a time) or they can all be active at the same time as cooperating potions of an overall data processing operation. In addition, multiple independent programs can be running on a multi-processing unit, consuming a variety of resources from the overall system resources that are available. Any such independent program or independent sub-program is referred to as a process.
For reliable execution, each processing device running a process must function properly throughout the entire process. If a process fails due to failure of a processing device, or is otherwise unable to complete the process, it is imperative that a failure notification be made to enable a system manager to implement appropriate corrective actions. Moreover, it is desirable that certain automation and redundancy be available to allow for automatic recovery in case of failure.
Failover systems enable failure detection and perform corrective actions, if possible. Referring to FIG. 1, an example of an active-inactive failover system is illustrated. An active-active failover system will have a similar operation with both nodes performing tasks and monitoring each other for operational functionality. Failover system 100 comprises two processing devices, active node 110 and inactive node 120. Active node 110 and inactive node 120 are connected through a communication link 130. The communication link can be hardwired or can be a wireless link. Processes are executed on active node 110, while inactive node 120 is basically dormant as far as execution of processes is concerned. However, inactive node 120 monitors the process on active node 110. If inactive node 120 detects a problem in active node 110, a failover mechanism will be initiated, as follows:    1. Active node 110 is instructed to shutdown all its activities;    2. Inactive node 120 becomes the new active node and restarts or resumes all activities;    3. If possible, the former active node (node 110) becomes an inactive node of the system, or otherwise failure notification is issued.
Typically, failover systems are used for devices such as network systems, central process units and storage systems. For example, a failover system for a network system consists of two nodes: one node functioning as the active provider of Internet related services (web services, file transfer services, etc.) to the public client network, and the other node (the inactive node) monitors those services and operates as standby system. When any service on the active node becomes unresponsive, the inactive node becomes an active node and replaces the failing previously active node. Such a failover system can be implemented using virtual Internet protocol (IP) addresses. A node can be accessed through its virtual IP address or by its regular host address. In an active-active implementation, both nodes would be performing their tasks and monitoring the other node. Upon detection of any kind of failure of a node, the other node will shut down the unresponsive node and re-initiate the activities of that supposedly failed node on that other node.
A general disadvantage of these systems is the necessity to shut down the active node and transfer all activities to the inactive node. A complete shutdown, however, of the active node is not always really necessary. It would be therefore advantageous, if a failover system, upon detection of a failure, such as an unresponsive process or processes, to terminate only those parts and initiate them on the inactive node, rather then terminating all applications running on the processing device or the entire process.