This invention relates generally to computer systems and more particularly to fault tolerant computer systems.
As it is known in the art, there are computer system applications where it is important to provide a continuously operating computer system even in the presence of some fault in the computer system to ensure that instructions or control are not lost and that data are not corrupted. Examples of such applications include computer processing in the financial industry and critical industrial facilities such as nuclear power plants and the like in which a failure of a computer system will cause serious disruption or catastrophic loss of data or control.
Fault tolerant computer systems have been developed to provide varying degrees of redundancy by providing duplicate system or system components so that data processing can continue even in the presence of some failure within the computer system.
Several approaches to provide a fault tolerant computer system are known in the art. In one approach, multiple multiprocessors each with its own memory resources conduct independent tasks. In another approach multiple multiprocessors share a common memory and conduct independent tasks with the common memory. A further approach is to use a plurality of multiprocessors each conducting identical tasks in unison. The approach provides periodic checks of the processing at locations in the computer to verify that the computers are operating on the identical tasks in unison, and thus that the different multiprocessors are operating in so called "lock step."
A preferred example of the latter approach is described in U.S. Pat. No. 4,907,228 by Bruckert et al. and assigned to the assignee of the present invention. In this approach a fault tolerant computer system includes two computer zones each operating on a common set instructions. Each zone further includes a pair of processors operating on a common set of instructions. In the event of a fault in one of the zones, as detected by one of the processors in said zone losing lock-step operation, the faulty zone is replaced while the good zone continues with processsing. Thus, to a computer user the fault is not noticed. Thereafter, the replaced zone is synchronized with the operating zone to restore fault tolerance to the fault tolerant computer.
It is also known in the art of fault tolerant computers and computer systems in general that random errors occur in memory due to various mechanisms. One particular mechanism which is common is so called radiation induced disruption. Of particular concern is alpha particle disruption in which an alpha particle (a hydrogen nucleus) attacks a circuit component such as a capacitor or a transistor in a memory cell of a semiconductor memory device causing a disruption (i.e. change in logic state) of the data bit stored in the memory cell. These types of errors occur randomly and frequently. Moreover, as semiconductor memory cells decrease in size the probability of any cell being disrupted by alpha particle radiation or other radiation correspondingly increases.
Generally, this type of low-level radiation is distinguished from high level radiation such as gamma and x-ray radiation because in the latter case massive disruption and potential permanent damage to the memory device occurs whereas in the former case random, scattered, non physically damaging disruption occurs.
The former low level radiation induces so called "soft errors" or a change in the logic scale of the cell. However, these soft errors can be repaired by rewriting the cell since the cell itself is not permanently damaged.
Generally, these types of soft errors are handled in a computer system by using a single bit error correcting two bit error detecting code.
In fault tolerant computer systems in particular, so called "scrubbing" of memory locations is often used to correct errors in data stored in memory. In a known technique, the central processing unit of the fault tolerant computer initiates a read operation of a particular memory location to transfer data at that location to a target device such as another memory, an input/output device or the CPU itself. As the data from the particular memory location is read various techniques are used to determine whether the read data has an error and if there is an error to correct the data. In any event corrected or good data are provided to the target device. If an error was detected however, the CPU stores the address of the faulty location and at some latter, convenient time the CPU rereads the data, corrects it again and writes the corrected data back into the memory location, thus correcting or scrubbing the error from the memory location.
One problem with this approach is that faulty memory locations are generally found only if the central processing unit happens to read the faulty location during a normal processing flow of the computer system. If a computer system accesses a portion of a memory relatively infrequently, such portion of memory becomes susceptible to an accumulation of single bit errors at the same memory address. If two or more single bit errors occur at the same memory address the single bit correcting, two bit error detecting code would be inadequate to correct the two errors. This would necessitate providing the computer with a more powerful error correcting code which becomes expensive due to potential slowing of processing speed and the need for increased circuitry in the computer including increasing memory requirements.
An alternative technique uses a software routine to have the central processing unit periodically read every location in the memory as a background task (software memory scrubbing). There are several problems with this approach however, particularly as applied to fault tolerant computer systems.
One problem with this approach is that correcting data as a background task still allows for the accumulation of single bit errors in memory. In particular for fault tolerant memory systems, an accumulation of uncorrected single bit errors increases the probability of having two errors at the same memory address. As mentioned above, fault tolerant computer systems typically use single bit error correcting and 2-bit error detecting error codes. Thus, such an accumulation of errors will cause an interruption in the computer operation of the fault tolerant computer since the error codes and hardware associated with the computer data are incapable of correcting more than a single bit error. Thus, depending upon the type of fault tolerant computer, the computer could enter a complex software routine to attempt to correct the multi-bit error. Thus, this approach would at best waste valuable processor time.
The potential of soft errors presents a further problem for those types of fault tolerant systems employing multiple central processing units operating on identical tasks since disruption of one of the CPU's prevents the multiple CPU's from running in so called "lock step." As mentioned above, the fault tolerant memory systems relying on replicated components presents special problems for single bit errors. The problem is that during a recovery from a fault in the system, it is necessary not to degrade the system or system performance and operation. Accordingly, an accumulation of single bit errors in memory presents a particular problem for fault tolerant computers, since single bit errors caused by alpha particles and the like occur relatively frequently.
Moreover, the use of elaborate machine check algorithms to resolve these errors is undesirable since in replicated fault tolerant systems, each CPU has to perform the same operations at the same time to remain in lock step and thus preserve fault tolerance. Thus, if one CPU receives a single bit error and the others do not it becomes very difficult to maintain all the CPU's in the fault tolerant system at the same state and thus in lock step.
This application contains disclosure also contained in copending applications:
1) Ser. No. 07/998,716 by McCollum et al, filed Dec. 30 1992, entitled "A Method and Apparatus for Preservation of System Information in a Fault Tolerant System", PA0 2) Ser. No. 07/998,715 by Bruckert et al, filed Dec. 30 1992, entitled "A Method and Apparatus for Reducing Checking costs in a Fault Tolerant Processors" and issued as U.S. Pat. No. 5,339,408, PA0 3) Ser. No. 07/998,714 by Hawkins et al, filed Dec. 30 1992, entitled "Apparatus and Method of Data Transfer Between Systems Using Different Clocks" and issued as U.S. Pat. No. 5,347,559, PA0 4) Ser. No. 07/998,977 by Hawkins et al, filed Dec. 30 1992, entitled "An Improved Phase Detector Apparatus"and issue as U.S. Pat. No. 5,539,345.