1. Field of the Present Invention
The invention is in the field of data processing systems and, more particularly, error detection in data processing systems.
2. History of Related Art
Computer systems represent information in bi-state transistors that can assume the logical values of “1” or “0”. These logical values are implemented by electrical signals, where a certain voltage level is assigned to represent a value of “1”, and a second level, sufficiently different from the first, is assigned to represent a value of “0”. Computer systems are susceptible to a type of error called “soft” errors that occur while the system is in operation. These errors result from electrical noise, cosmic rays, thermal effects, and other factors that may alter an electrical signal that is stored in a transistor. For example, cosmic alpha particles can hit a transistor and change the value of the electrical signal stored in it, such that the logical value stored in the transistor can be altered from “1” to “0” or vice versa.
The effect of soft errors is transient—they do not cause any permanent damage to the machine hardware. However, soft errors corrupt the values stored in transistors used in the computation, and thus the machine may produce incorrect results for the programs that were running when the soft error occurred. Computer designers have recognized these problems since the early days of computing and invented several mechanisms of redundancy to overcome them. Three notable techniques are Error Checking and Correction (ECC) codes, hardware system redundancy, and circuit-level testing and redundancy. ECC codes are used in substantially every computer to detect and possibly recover from the effects of soft errors on the values stored in main memory and machine registers. Depending on the code, one can detect one bit errors, two-bit errors, etc. Moreover, some codes can be used to undo or correct the effects of soft errors when they alter the values stored in main memory or machine register. ECC codes are useful in guarding data that is being stored in main memory or machine registers, and can also be used to guard data while it is being transferred (e.g., over a data bus). ECC codes, however, cannot be used in a straightforward manner in protecting against soft errors that may affect circuit logic, such as the Arithmetic and Logic Unit (ALU), the Branch Unit (BU), etc. For these components the hardware system redundancy and circuit-level testing and redundancy techniques are more effective.
Hardware-level redundancy can guard against the effects of soft errors and other types of failures as well. Systems and subsystems are replicated to detect and possibly recover from errors. This technique depends on the reasonable assumption that errors will occur differently in different replicas. The degree of replication can vary. For a degree of replication of 2, one can detect the effects of soft errors if they alter the results of computation in the replica in which the error occurs. This can be done by simply comparing the output of both replicas and declaring an error if the results do not match. One can increase the degree of replication to 3, in which case the “correct” result will be determined by voting. Assuming that one error occurs, it will drive one of the 3 replicas to produce an incorrect output that is different from the correct outputs that are generating by the two other replicas. Thus, a 2-out-of-3 voting can determine the correct input. Unfortunately, hardware redundancy requires deterministic execution by the application, which is not always feasible for modern, multithreaded applications that use a thread library such as POSIX Threads (pthread) or applications written in the Java® programming language developed by Sun Microsystems.
Circuit-level testing and redundancy are used to guard against the effects of soft errors as they relate to logical circuits that are used to compute rather than store information. For example, logical AND or OR gates can be affected by soft errors and produce erroneous results. Circuit-level testing and redundancy can guard against these errors by several techniques, all of which fundamentally depend on recomputing the values on the same circuit or similar circuit to produce the results at different times or places. The idea is that a transient error would affect the results in one of the two computations, and thus by comparing the results of the two computations one can detect the effect of the error if a discrepancy exists. This is similar to the system-level replication, except that it is done at the circuit level. As a result, the detection is done within the time span of executing a single instruction. This method is popular as it masks the effects of errors and simplifies the design of the upper system hardware and software layers.
The existing approaches have several shortcomings including cost, inefficiency, and rigidity. With respect to cost, adding redundancy at the hardware level or through system-level replication increases the cost of the design, test, manufacture, and deployment. Cost escalates because of the additional components that are needed to execute the circuit self-test, comparisons, and recomputations. These extra components also reduce the yield that we receive on semiconductor chip fabrication, and thus increase cost further.
Regarding the inefficiency of existing methods, the additional hardware and built-in tests reduce the speed of the machine at the lowest level, forcing circuit designers to use slower components and architectures. Existing methods also fail to exploit new features that can be used to implement redundancy at higher levels, such as simultaneous multi-threading (SMT) and multi-core chip design at the hardware level. It is desirable therefore if more efficient error detection and recovery techniques be implemented at a higher level and reduce the implementation overhead at the hardware level.
With respect to the rigidity of existing approaches, conventional error detection techniques do not generally reflect the actual deployment environment. For instance, the requirement of deterministic execution is necessary for system-level redundancy, which is very difficult to ascertain in real systems. These methods also fail to recognize that errors can occur at different rates in different environments, and that the importance of reliability in an application depends on its criticality. It is recognized that soft errors, for instance, occur more frequently at high altitudes than at sea level. Additionally, one would assume that it is more important to secure mission-critical applications than to secure entertainment programs against soft errors. Thus, it would be desirable if error detection and recovery can be adapted to offer a tradeoff in performance and cost versus the degree of error coverage and recovery that would be desired.