1. Field of the Invention
The present invention relates to the field of fault-tolerant computer systems. More specifically, the present invention relates to the problem reconnecting partner software processes that share an interface when any of these software processes may independently fail over to a redundant backup copy of the software process. The present invention solves this problem in a manner that is independent of the capabilities of the hardware and operating system used and ensures that the partner software processes do not have to be aware of the redundancy scheme used by their partners.
2. Description of Prior Art
Fault-tolerant computer systems use a variety of techniques to provide highly-available systems for use in safety-critical or mission-critical environments. Many different approaches have been taken by different organizations to achieve fault-tolerance.
One approach to fault-tolerance is to use specialized hardware and operating systems to mirror all inputs to a number of redundant processing units. Outputs from the system are taken from just one processing unit, called the primary, until it is determined to have failed and another processing unit is selected as the primary. Another approach is to take a majority vote for the correct output, and disabling any processing unit which disagrees with this output on the assumption that it has failed. For further details of this approach to fault-tolerance, see the following U.S. Pat. Nos. 5,271,013, Gleeson; 5,363,503, Gleeson; 5,560,033, Doherty et al.; and 5,802,265, Bressoud et al.
An alternative approach is to provide fault-tolerance in the software process layer, which avoids the need for specialized hardware or operating system support. This approach is also more easily deployed on a cluster of heterogeneous processing units with different hardware characteristics, since it does not rely on specific attributes of the hardware. Software fault-tolerance, as this approach is commonly called, typically uses a combination of redundant backup software processes and replication of internal state between the primary and backup copies of each software process to speed recovery from any software or hardware faults. However, many practical fault-tolerant systems combine both hardware and software fault tolerance techniques. For further details of the general techniques used to achieve software fault-tolerance, see the following U.S. Pat. Nos. 5,129,080, Smith; and 5,748,882, Huang. See also the following publications: Hardware and Software Architectures for Fault Tolerance, Chapter 3, ed. Banatre et al., Springer-Verlag 1994; Fault Tolerance in Distributed Systems, Chapter 5, Jalote, Prentice Hall 1994; and Fault-Tolerant Computer System Design, Chapter 7, Pradhan, Prentice Hall 1996.
A common problem in software fault tolerance is the need to reconnect partner software processes quickly and efficiently after one or more of the partners has failed over to a redundant backup copy. In many systems, this is achieved by using special hardware or operating system facilities to allow such reconnection. However such mechanisms are difficult to implement in heterogeneous distributed systems.
The present invention, known as a xe2x80x9cjoinxe2x80x9d, is a means of connecting two partner processes and automatically reconnecting them after one or both of the partner processes fails over to a redundant backup copy. This is achieved by use of a join manager component that allows the partner processes to register the joins between them and manages the reconnection after a fail over.
The present invention has the following advantages over prior art:
The present invention is independent of the system hardware architecture or operating system, and can be used in heterogeneous distributed systems.
The partner processes associated with a join do not need to know whether the other processes associated with a join employ a redundancy scheme or what that redundancy scheme may be, yet can be reconnected successfully even if the more than one partner to a join fails simultaneously.
The join manager actively controls reconnection, so no polling mechanism is required in the partner processes to achieve the reconnection. This saves on processor and communications resources by avoiding repeated unsuccessful attempts to poll a failed partner to see if it has recovered.