Use of data processing systems has grown exponentially in recent years with the increased use of computing devices. Users have come to rely on data processing systems in every aspect of business and society. With this reliance therefore, preventing soft errors has become increasingly important to a system's overall performance.
As technology feature sizes continue to shrink due to semiconductor advancements, opportunities exist for microprocessor design, for example, at continuing performance improvement. At the same time however, as devices get smaller and smaller, there is an emerging real concern for future generation computing systems' susceptibility to soft and transient errors. Soft and transient errors are generally caused by effects of alpha particles and cosmic rays, as well as to some extent, power and voltage glitches. Alpha particles or neutrons (as in cosmic rays) hitting transistors generate electron-hole pairs, which the devices may collect. Soft errors happen when the amount of charge collected by a device exceeds the device's critical charge, often referred to as Qcrit, required to upset the device from its normal operation. Soft errors appear as current pulses in transistors. They may cause errors in combinational logic or cause bit flips in random access memory (RAM) cells. Historically, soft errors were only of great concern to outer space applications where cosmic rays are strong and random access memory were designed with very small Qcrit. However, technology projections indicate that the average Qcrit per transistor will reduce by a factor of two for each new generation of technology as transistors get smaller and smaller. Hence, it is expected that failures in time (FIT) for the typical microprocessor will increase very quickly as the future technology advances to device miniaturization. Even if the average Qcrit per transistor for the storage and logic cell were to remain the same, it is clear that with increasing miniaturization, more and more transistors will fit into the space that had hitherto been occupied by one or two transistors. Hence, the incidence of soft failures per a fixed circuit area is bound to increase.
To ensure and protect computer systems against soft errors in general, many fault tolerance approaches have been used traditionally to detect and possibly correct errors. These approaches basically comprise of information redundancy and execution redundancy. Data storage structures (e.g. SRAMS, register arrays, and queues) within a microprocessor chip, due to their regular patterns, tend to be protected by well-known information redundancy techniques like parity protection and error correcting codes (ECC). Combinational logic structures (e.g. ALUs, FXUs, and FPUs), within a processor chip, on the other hand, have irregular patterns, which make it necessary to protect through execution redundancy. Execution redundancy can be further distinguished between time and space. Space redundancy is achieved through executing a task or instruction on multiple disjoint hardware structures, and comparing the results for accuracy. Space redundancy generally has low performance overhead but necessitates hardware in proportion to the number of disjoint computations. Time redundancy is achieved through executing a task or instruction on the same hardware multiple times, and comparing the results for accuracy. Time redundancy generally has low hardware requirements overhead but results in high performance overhead, but given the trends of leakage power adverse effects on a microprocessor chip's general health, the time redundancy concept remains a good option for protecting a system against errors.
It is therefore not surprising why there have been many recent time redundancy-based approaches for microprocessor error detection and correction, for example, in some form utilizing the general concept of redundant multithreading. Generally, redundant threading provides fault tolerance in a microprocessor by executing the given task or application using two separate threads in simultaneous multithreading (SMT), concurrent redundant threading (CRT), or chip multithreading (CMP) environments, comparing between the two (leading/master and trailing/slave) threads at some specified point in the pipeline, and where there are disagreements, flushing the microprocessor pipeline and rolling back to a previously verified and saved checkpoint to re-start the computation. These various redundant threading mechanisms target idle processing bandwidth either at a fine-grain level as a result of unused processing slots due to limited instruction level parallelism (ILP) in each cycle or at a coarse grain level due to long-latency events like level two cache misses. As expected, hardware implementation for these methods typically result in a high area overhead (estimated to be about 35%) on chip and could result in a performance overhead of about 40% as well.
These redundant threading soft error protection solutions all require the processor's ability to rollback to a previous checkpoint and redo some computations when a miss-comparison occurs. See, E. Rotenberg, “AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors”, Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing, June 1999; S. K. Reinhardt and S. S. Mukherjee, “Transient Fault Detection via Simultaneous Multithreading”, Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000; M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz, “Transient-Fault Recovery for Chip Multiprocessors”, Proceedings of the 30th Annual International Symposium on Computer Architecture, June 2003; T. N. Vijaykumar, I. Pomeranz, and K. Cheng, “Transient-Fault Recovery via Simultaneous Multithreading”, Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002.
All these widely discussed techniques tend to limit the error checking to be inside the processor core since their recovery abilities for a normal processor is often confined to the instruction window. In order to recover beyond the instruction window therefore, a form of cache state buffering is necessary. However, if every store or load from a leading thread has to be compared against its corresponding equivalent or counterpart from the trailing thread in a redundant threading execution before it is committed to cache memory, the two threads would have to be very highly synchronized at all times during execution, thus causing a high performance overhead and undue contention on microprocessor resources. Given that these soft errors do not happen very frequently, it would be useful to execute the program at full speed for a certain amount of instructions before stopping for a soft error check. Error checking can be done by comparing the architecture states of two threads or two cores (including the cache state). Once an error is detected, the ability to rollback to the previous checkpoint is needed which may be thousands of instructions earlier and that requires the buffering of all the intermediate state data.
The current first level (L1) cache structure and organization in a microprocessor is not conducive for the courser grain redundant threading that we describe in the last paragraph since intermediate data must be buffered to avoid corrupting or destroying a saved checkpoint state, and if there is a need for a roll-back and re-compute, the intermediate data must be squashed efficiently. T. N. Vijaykumar, S. Gopal, J. E. Smith, and G. Sohi, “Speculative Versioning Cache”, Proceedings of the Fourth International Symposium on High-Performance Computer Architecture, February 1998, proposes a speculative cache organization along the lines of the speculative versioning cache for thread-level speculation (TLS) purposes. Unlike in the case of TLS where the main requirement is to either commit or squash a speculative version of a cache, however, redundant threading or similar techniques for soft error protection further demands an efficient comparison (or checking) among the multiple cache versions for detecting and correcting soft errors during execution.
In addition, when a soft error occurs and the microprocessor needs to roll back to a previous checkpoint, i.e. the old cache status at a previous checkpoint needs to be recovered, known L1 cache memory structure is not capable of doing so because data in the L1 cache would have already been overwritten by the writes from the processor. What is needed, therefore, is an efficient system and method for detecting soft errors with correction or rollback capabilities.