As it is known in the art, often in transaction processing, it is necessary to have a continuously operating computer environment to ensure that transactions are not lost and data are not corrupted. For example, the loss of computer resources to a banking or commodities industry for any length of time could potentially result in the loss of millions of dollars. When there is a fault in the computer, such as a memory error or a processor error, the delay until the computer is ready to restart processing may range from a few minutes to a few days. Fault tolerant computers, which are able to recover from many common computer faults and continue processing, minimize the loss of computer resources in the event of a computer failure.
Fault tolerant computers use a mixture of hardware redundancy and error checking mechanisms to provide a continuously operating computer environment. For example, dual processing systems, each referred to as a zone and including a processor module, memory module and I/O interface module, may function simultaneously on a common set of instructions. Both processing systems are connected to identical I/O devices, and perform identical transactions. In the event that one of the processors (zones) fails, the other processor may continue processing the transactions while the first processor is repaired. When the first processor is repaired, it is replaced in the dual processor system, and a resynchronization process is initiated to enable the processors to regain simultaneous operation.
Further hardware redundancy on the processor module of each processing system also improves the fault tolerance of the dual processor system. For example, U.S. Pat. No. 4,907,228, titled Dual-Rail Processor With Error Checking at Single Rail Interfaces issued Mar. 6, 1990 to Bruckert et al describes a fault tolerant computing system including duplicate systems, called zones, wherein each zone includes duplicate processing systems, herein referred to as `rails`, operating simultaneously on the same instructions.
The processing system of Bruckert et al is a dual zone, dual rail system. The two rails in each zone are referred to as the primary and mirror rail, and both include a separate CPU, cache controller, cache memory, memory controller, and an internal bus. Each internal bus includes lines for carrying data signals, ECC signals and address and control signals. Each of the CPU's provides address information, error information, and 32 bits of data to their respective internal bus, which are checked for consistency by the memory controllers. One memory module is shared by both rails. The memory controllers provide a dual rail-to-single rail interface to the memory module.
Error checking is provided at the dual rail-to-single rail interface, using two techniques. In the first error checking technique (used, for instance, during a microprocessor write to memory), the data signals from the second CPU and memory controller are compared with the data signals from the first CPU by the second memory controller to ensure that the data on both lines are the same. Data on one of the busses is stored in a memory device on the memory module. The memory module compares the addresses, control signals and ECC's from each of the memory controllers to detect any inequality. The second error checking technique (used, for instance, during a microprocessor read from memory) involves each memory controller generating its own ECC from the memory data.
A system using replicated hardware and data paths and error checking mechanisms provides the three basic functions which ensure that a high degree of fault tolerance is met. Specifically, such a system allows for the detection and report of an error, the removal of the effects of the error, and the return of the system to full redundancy upon recovery from the error.
However, the replication of the hardware and data paths for each rail increases the expense and complexity of the fault tolerant design. A problem is encountered when combining the fault tolerant approach of hardware replication to the latest architectures and processors, many of which utilize wider data busses of 64 or 128 bits, to decrease memory latency and thereby increase system performance. The use of a larger data bus increases the problems associated with routing the data bus, increases the signal noise levels, and also increases the memory components necessary to buffer data on the data bus. Thus, although replication of hardware ensures fault tolerance, as technology advances, replication becomes undesireable due to the increased complication and expansion associated with the replicated, fault tolerant design.