Two classes of hardware-related errors are considered to occur in computational systems: hard errors and soft errors. A hard error is manifested as an improper behavior of the operation of a computer system that persists and continues to cause the system to produce improper behavior and results for a significant period after an initial error occurs. A soft error is a non-recurring error generated by a temporary anomaly in a computer hardware device. Soft errors involve an improper behavior of the computer system that does not persist beyond a certain period of time. After this time has elapsed further operation of the system proceeds normally.
As the physical devices that make up computer systems have become smaller and more numerous, many recurring physical phenomena are now more likely to cause temporary faults in the operation of these devices resulting in the disruption of the operation of the digital logic and state making up a computing system, often resulting in soft errors. Soft errors are generally more difficult to detect than hard errors. Soft errors are assumed to be more frequent than hard errors and are also assumed to occur sufficiently often that their effect should be considered in computer systems design. Undetected soft errors can result in incorrect results being reported as the result of a computation, corrupt data being stored to disk or other persistent media, or transmitted over network connections, or result in anomalous behavior of a program or of the entire computer system. It is desirable to provide error detection coverage for the subsystems of the computer system architecture which have the highest error rates using techniques which provide detection of soft errors and, optionally, of hard errors. These subsystems typically include the system main memory, the various levels of processor caches as well as system TLB (translation lookaside buffers), I/O and interconnection ‘fabric’. When an error is detected it is often desirable to provide a way of correcting the error so that the computation can continue to produce a correct result. If an error occurs in one of these subsystems, the error will be detected and corrected before it is delivered to other subsystems, thereby obviating the need for the error to be addressed by the other subsystems. This leaves the uncovered subsystems to be addressed. In many computer system designs large portions of the central processing unit are not covered by error detection or error correction.
With the continuing development of VLSI processors having ever-increasing component density, the susceptibility of these processors to ‘soft’ errors caused by sources such as cosmic rays and alpha particles is becoming an issue in the design of computational systems. Error detecting and correcting codes are widely applied to the design of computer system memory, caches and interconnection fabric to verify correct operation and to provide correction of the representation of data in the event that either soft or hard errors occur. Protecting the processor electronics is a more difficult task since a processor has many more structures of greater complexity and variety than computer memory devices. Existing hardware techniques for protecting the processor electronics require the design and incorporation of significant logical structures to check, contain and recover from errors which might occur in the core structures that make up the processor.
Other processor-oriented error detection techniques have included providing multiple processors running the same instructions in ‘lock step’ and associated self-checking hardware to verify that all results visible externally from each processor match the results of each (or a majority) of its peers to ensure correct operation. In implementation of these techniques where the comparisons do not match, additional complexity is required to limit the propagation of any erroneous state. In addition, special procedures must be performed to either rule the result of the computation as invalid or to recover the state of the computation. All of this adds to the cost and complexity of the system design.
Software techniques have also been proposed to address errors in computation. Some of these techniques involve fully executing a program multiple times and comparing the results, and then re-executing the computation until the results match. All of the above techniques multiply the computing resources and time required for a computation to complete. Furthermore, some of these techniques will not detect certain classes of hard errors. Other software fault tolerance techniques assume that a computation will fail in such a way that the computation will stop or ‘fail fast’, or that errors will be detected by error exception checking logic normally incorporated in processor designs. These techniques often provide inadequate coverage of soft errors.
From the foregoing, it can be seen that methods for detecting improper operation of computer systems often require extensive hardware and software to support the detection of improper operation, to minimize damage resulting from incorrect results due to improper operation, and also to minimize the number and extent of special actions needed to recover and continue processing in the face of a detected fault. Such systems have often employed doubly or triply redundant hardware and extensive checking and correction logic beyond that required for the basic computation environment itself. Alternative software fault tolerance techniques typically require the adoption of specialized programming techniques which can impact the design of system and applications software, or which require multiple executions of a program and subsequent comparison of the results of two or more program executions.
The implementation of existing techniques for detecting soft errors, either hardware- or software-based, thus requires significant additional hardware, software, and/or other resources.