A distributed recovery block is a method of integrating hardware and software fault tolerance in a single structure without having to resort to N-version programming. In N-version programming, the goal is to design and code the software module n times and vote on the n results produced by these modules. The recovery block structure represents a dynamic redundancy approach to software fault tolerance. In dynamic redundancy, a single program or module is executed and the result is subject to an acceptance test. Alternate versions are invoked only if the acceptance test fails. The selection of the routine is made during program execution. In its simplest form as shown in FIG. 1, a standard recovery block structure 100 consists of: a primary routine 110, which executes critical software function; an acceptance test 120, which test the output of the primary routine after each execution; at lease one alternate routine 115 that performs the same function as the primary routine and is invoked by the acceptance test 120 upon detection of a failure.
In a distributed recovery block 101 the primary and alternate routines are both replicated and are resident on two or more nodes interconnected by a network. This technique enables standby sparing fault tolerance where one node 105a (the active node) is designated primary and another node 105b (the standby node) is a backup. Under fault-free circumstances, the primary node 105a runs the primary routine 110 whereas the backup node 105b runs the alternate routine 115 concurrently.
In case of a failure, the primary node 105a attempts to inform the backup through the monitor 108 via the heartbeat thread 107. When the backup receives notification, it assumes the role of the primary node. Since the backup node has been processing the alternate routine 115 concurrently, a result is available immediately for output. Subsequently, recovery time for this type of failure should be much shorter than if both blocks were running on the same node. If the primary node 105a stops processing entirely, no update message will be passed to the backup. The backup detects the crash by means of a local timer in which timer expiry constitutes the time acceptance test.
The failed primary node transitions to a backup node, and by employing a recovery block reconfiguration strategy both nodes can be assured to not be executing the same routine.
A distributed recovery block with real time process control is referred to as an extended distributed recovery block (EDRB) 102. The EDRB includes a supervisor node 103 connected to the network to verify failure indications and arbitrate inconsistencies; and regular, periodic heartbeat status messages.
In EDRB, nodes responsible for control of the process and related systems are called operational nodes and are critical. The operational nodes perform real time control and store unrecoverable state information. A set of dual redundant operational nodes is called a node pair. Multiple redundant operation nodes are node sets.
Regular, period status messages are exchanged between node pairs and each node pair in a node set. The messages are referred to as heartbeats. A node is capable of recovering from failures in its companion in standalone fashion, if the malfunction has been declared as part of the heartbeat message. If a node detects the absence of it companion's heartbeat, it request confirmation of the failure from a second kind of node called the supervisor. Although the supervisor is important to EDRB operation, the supervisor node 103 is typically not crucial because its failure only impacts the ability of the system to recover from failures require its confirmation or arbitration. The EDRB system can continue to operate without a supervisor 103 if no other failures occur.
In FIG. 1 the software structure in a node pair is shown. Operational nodes employ active redundancy. One node pair member is always active, the other is always standby if, it is functional. The active node 105a executes a primary version of a control process in parallel with an alternate version executed on the standby node. Both nodes check the correctness of the control outputs with the acceptance test 120.
Within an operational node, the EDRB is implemented as a set of processes communicating between node pairs and the supervisor 103 to control fault detection and recovery. The two processes responsible for node-level fault decision making are the node manager 106 and the monitor 108. The node manager 106 determine the role of the local node (active or standby) and subsequently triggers the use of either the primary 110 or the alternate routine 115. If the primary routine acceptance test is passed, the node manager 106 permits a control signal to be passed to the device drivers 130 under its control. If the acceptance test is not passed, the active node manager 106a request the standby node manager to promote itself to active and immediately send out its result to minimize recovery time.
The monitor 108 associated with node manager 106 is concerned primarily with generating the heartbeat and determining the state of the companion node. The heartbeat is a ping or other rudimentary signal indicating functionality of the respective node. When an operational node fails to issue a heartbeat, the monitor processes request permission from the supervisor to assume control if not already in the active role. If the supervisor 103 concurs that a heartbeat is absent, consent is transmitted and the standby node 105b promotes itself to active node.
If the active node 105a spuriously decides to become a standby node or a standby node makes an incorrect decision to assume control. As a response the supervisor node 103 will detect the problem form periodic status reports. It will then send an arbitration message to the operation nodes in order to restore consistency.
In many computer networks, particularly in communication system, the supervisory node 103 is critical, providing frame synchronization and connection routes between the network and users. Thus, the loss of a supervisory node results in loss of the node function. Thus, there is a need for a multiple redundant architecture in which not only are the nodes replicated, but also the network. In addition there is a need for implementation of agent oriented software to facilitate the functionality of such an architecture.