The invention relates generally to fault-tolerant transaction processing systems formed from multiple processor units to maintain information collections (e.g., a database), and to from time-to-time modify that collection. More particularly, the invention relates to a method for detecting the loss of a processor unit participating in a transaction that is in the process of changing the state of the information collection maintained by the system.
Concern about protecting and maintaining the integrity of information collections in the face of updates and changes to that information has resulted in the development of a programmatic construct called a transaction. A useful definition of a transaction is that it is an explicitly delimited operation, or set of related operations, that change or otherwise modify the content of the information collection or database from one consistent state to another. Changes are treated as a single unit in that all changes of a transaction are formed and made permanent (the transaction is "committed") or none of the changes are made permanent (i.e., the transaction is "aborted"). If a failure occurs during the execution of a transaction, the transaction can be aborted and whatever partial changes were made to the collection can be undone to leave it in a consistent state.
Typically, transactions are performed under the supervision of a transaction manager facility (TMF). In geographically distributed systems, such as multiple processor unit systems or "clusters" (i.e., a group of independent processor units managed as a single system), each processor unit will have its own TMF component to coordinate transaction operations conducted on that processor unit. The processor unit at which (or on which) a transaction begins is sometimes called the "beginner" processor unit, and the TMF component of that processor unit will operate to coordinate those transactional resources remote from its resident processor unit (i.e., resources managed by other processor units). Those TMF components running on processor units managing resources enlisted in a transaction are "participants" in the transaction. And, it is the TMF component of the beginner processor unit that initiates the steps taken.
Fault tolerance is another important feature of transaction processing. Being able to detect and tolerate faults allows the integrity of the collection being managed by the system to be protected. Although a number of different methods and facilities exist, one particularly effective fault tolerant technique is the "process-pair" technique as it is sometimes called. According to this technique, each process running on each processor unit of a multiple processor system will have a backup process on another processor unit of the system. If a process, or the processor unit upon which the process is running, fails, that failure will bring into operation the backup process to take over the operation of the lost (primary) process. If that failure occurs during a transaction in which the lost process was a participant, the backup will decide whether or not to notify the beginner processor unit to abort the transaction and begin over again. In this way the state of the collection managed by the system remains consistent.
The process-pair paradigm uses what is sometimes called a "Heartbeat" or "I'm Alive" approach to detecting failure of a processor unit. Briefly, according to this approach, each processor unit is required to periodically broadcast an "I'm Alive" message to all other processor units of the system. If the heartbeat message of a particular processor unit has not received its siblings within a predetermined period of time, the silent processor unit is assumed to have failed and all primary processes resident on or associated with the now assumed failed processor unit will be taken over by their backup processes on the other processor units of the system. Each backup process, when taking over, will investigate whether or not it was involved in a transaction, and if so, decide whether or not to abort the transaction. An example of the process-pair concept using "I'm Alive" detection of processor failures can be found in U.S. Pat. No. 4,817,091.
But there are times when a process may not have a back-up process--even though resident in a multiple processor system employing process-pair fault tolerance. If that process is a participant in a transaction, and the processor unit upon which that process runs fails, the TMF component on the beginner processor unit may be aware of the failure and the loss of the processor unit, but not of the participant process. If a modification to be made by the participant process was never made, yet the other participants were able to complete their modifications, the result can severely damage the integrity of the managed collection, i.e., the collection is now inconsistent.
Accordingly, it can be seen that there exists a need for a fault-tolerant method of notifying a transaction manager of the loss of a participant process as a result of the associated processor unit failing, separate and apart from employment of a process-pair fault detection technique.