Use of data processing systems has grown exponentially in recent years because of the increased use of computing devices. Users have come to rely on data processing systems in every aspect of business and society. With this reliance therefore, preventing potential undetectable errors in a microprocessor during program execution has become increasingly important to a system's overall performance.
As technology feature sizes continue to shrink due to semiconductor advancements, microprocessor design has a chance at continuing performance improvement. At the same time however, as devices get smaller and smaller, there is an emerging real concern for future generation computing systems' susceptibility to soft and transient errors. Soft and transient errors are generally caused by effects of alpha particles and cosmic rays, as well as to some extent, power and voltage glitches. When alpha particles or neutrons (as in cosmic rays) hit transistors, electron-hole pairs are generated and these might be collected by devices. Soft errors happen when the amount of charge collected by a device exceeds the device's critical charge, often referred to as Qcrit, required to upset the device from its normal operation. Soft errors appear as current pulses in transistors. They might cause errors in combinational logic or cause bit flips in random access memory (RAM) cells.
Historically, soft errors were only of great concern to outer space applications where cosmic rays are strong and random access memory were designed with very small Qcrit. However, technology projections indicate that the average Qcrit per transistor will reduce by a factor of 2 for each new generation of technology as transistors get smaller and smaller. Hence, it is expected that failures in time (“FIT”) for the typical microprocessor will increase very quickly as we go into the future of device miniaturization. Even if the average Qcrit per transistor for the storage and logic cell were to remain the same, it is clear that with increasing miniaturization, more and more transistors will fit into the space that had hitherto been occupied by one or two transistors. Hence, the incidence of soft failures per a fixed circuit area is bound to increase.
To ensure and protect computer systems against soft errors in general, many fault tolerance approaches have been used traditionally to detect and possibly correct errors. These approaches basically comprise information redundancy and execution redundancy. Data storage structures (e.g. SRAMS, register arrays, and queues) within a microprocessor chip, due to their regular patterns, tend to be protected by well-known information redundancy techniques like parity protection and error correcting codes (ECC). Combinational logic structures (e.g. ALUs, FXUs, and FPUs), within a processor chip, on the other hand, have irregular patterns which make it necessary to protect them through execution redundancy.
Execution redundancy can be further distinguished between time and space. Space redundancy is achieved through executing a task or instruction on multiple disjoint hardware structures, and comparing the results for accuracy. Space redundancy generally has low performance overhead but necessitates hardware in proportion to the number of disjoint computations.
Time redundancy is achieved through executing a task or instruction on the same hardware multiple times, and comparing the results for accuracy. Time redundancy generally has low hardware requirements overhead but results in high performance overhead, but given the trends of leakage power adverse effects on a microprocessor chip's general health, the time redundancy concept remains a good option for protecting a system against errors. It is not uncommon to see a mix of information redundancy and time redundancy implementations for reliability protections in high-end microprocessor system cores.
Soft error reliability support in server microprocessors have evolved from the era of entire replication of processing units to the current single core approach with pervasive detection supported by a multi-stage auxiliary Recovery unit (or R-unit for short) pipeline for storing the checkpointed states of the processor execution. Whereas the former approach suffered from about 40% in area overhead, the latter shows about 15% area overhead. However, the latter approach, though less in area overhead, is still disadvantaged in area overhead and can be disadvantaged in error coverage which depends greatly on how much detection support that can be provided in the core pervasives. In this emerging leakage- and yield-sensitive design era, both of these approaches appear non-scaleable for reliability, availability and serviceability (“RAS”) support, especially when one considers the very basic fact that the average simultaneous number of threads running on a core is fast increasing.