1. Field of the Invention
The present invention relates to error-detection techniques in computer systems. More specifically, the present invention relates to a method and apparatus for using in-situ self-sensing in combination with parity-space detection to distinguish between soft errors and the onset of hardware degradation in computer systems.
2. Related Art
Cosmic neutrons are often responsible for causing transient errors, also called soft errors or correctable-errors (CEs), in integrated circuit (IC) logic and memory components. Two trends are causing the incidence of these soft errors to increase with each new generation of IC logic and memory components: (1) the density of memory cells continues to increase exponentially, thereby providing many more “targets” for each cosmic neutron; and (2) supply voltage is decreasing, thereby making these components more susceptible to cosmic neutron events. (Note that the cross-section for cosmic neutron events increases exponentially with the inverse of voltage.)
Changes in soft error rates (SERs) can signify the onset of hardware degradation. To improve the reliability, availability, and serviceability (RAS) of a computer system and to predict the onset of hardware degradation, the SER of a computer system can be monitored using a soft-error rate discrimination (SERD) technique to determine if the SER is increasing. However, a SERD technique that gives too many false alarms can create customer dissatisfaction and lead to excessive “No-Trouble-Found” (NTF) events. Therefore, a technique that facilitates accurately distinguishing between soft errors and the onset of hardware degradation is highly desirable.
One technique for distinguishing between soft errors and the onset of hardware degradation is to compare the cosmic neutron events reported by a neutron detector with CE events reported by a computer system. Unfortunately, a neutron detector is expensive, which makes it impractical to incorporate such a neutron detector into every computer system.
Another technique for distinguishing between soft errors and the onset of hardware degradation is to assign a threshold to CE events. A fixed-threshold SERD technique assumes that the cosmic neutron flux is a stationary process with time. Unfortunately, the cosmic neutron background is not stationary with time, but instead, has large dynamic variations (such as peaks and troughs) that are superimposed on top of long-term variations. The short term spikes as well as the long term variations result from variations in sun-spot activity and other cosmic events. These events can cause dynamic variations by as much as a factor of six in hourly cosmic neutron flux levels at sea level (and even larger variations at higher altitudes). In addition to short-term fluctuations that are attributable to the “burstiness” of cosmic events, there are also systematic long-term variations that occur over the course of weeks, and an additional 20% long-term variation that correlates with the 11-year sunspot cycle.
These inherent dynamic variations in soft error likelihood impose a fundamental limit on the sensitivity with which changes in SER can be detected. If there is no way to dynamically adjust the likelihood for soft error events, the threshold for SERD must be set above the levels attained by the highest daily peaks in cosmic flux. However, if a change in SER occurs during the “troughs” in cosmic activity, the SERD technique will be insensitive these changes.
A second challenge that affects both the conventional SERD and Sequential Probability Ratio Test (SPRT) techniques is dealing with acceleration of SER due to altitude. Due to less atmospheric attenuation of cosmic particles at high altitudes, there can be as much as a 70% acceleration in cosmic neutron flux between a datacenter at sea level and a datacenter at higher altitude. Similarly, if a constant-threshold “leaky bucket” technique is adjusted so as to not give excessive false alarms for datacenters at high altitudes, the technique does not catch the onset of hardware degradation for customers at sea level.
Yet another technique for distinguishing between soft errors and the onset of hardware degradation is to use an “N over T” (N/T) threshold, also called a “leaky bucket” technique. If there are N events within some time interval T, then the memory is declared faulty and is replaced. Typical values of N/T range from 3 CE events in 24 hours to 24 CE events in 24 hours.
Unfortunately, cosmic events are not stationary with time. More specifically, there can be significant peaks and troughs in cosmic activity. Furthermore, these variations can increase memory NTF events, which are costly in terms of the hardware exchanged, serviceability costs, and customer dissatisfaction. Note that when memory is replaced due to normal cosmic neutron events, the new memory is just as likely to exhibit CE's as the replaced memory.
Hence, what is needed is a method and an apparatus for distinguishing between soft errors and the onset of hardware degradation without the problems described above.