1. Field of the Invention
The disclosures herein relate generally to test systems, and more particularly, to a methodology and apparatus for testing software program susceptibility to soft hardware errors.
2. Description of the Related Art
An information handling system (IHS) may include a processor integrated circuit (IC) for processing, handling, communicating or otherwise manipulating information. Modem IHSs often include integrated circuits (ICs) that incorporate several components integrated together on a common semiconductor die. Some IHSs operate as test systems or test managers that evaluate the functionality and performance characteristics of IC designs during the development process of the IC design. A device under test (DUT) is another name for an IC design on which a test system conducts tests.
During operation, ICs may experience hard errors or soft errors. Hard errors are IC faults that persist over time. For example, an IC may experience a short circuit or an open circuit that does not go away with time. In contrast, a soft error is an error that may occur once and then not recur over time. For example, a cosmic ray or alpha particle may pass through a latch in the IC and cause the latch to change state or “flip”. Noise in a circuit adjacent the IC may also cause a soft error.
Unfortunately, soft error rate (SER) is increasing in today's ICs due to higher device density in these ICs. Lower IC operating voltage also makes an IC more susceptible to soft errors, thus causing higher SER than in the past. Arrays within ICs, such as memory and caches, are susceptible to soft errors. Combinatorial logic within ICs is also susceptible to soft errors. A conventional way to deal with increasing SER in memory arrays is to employ error correction code (ECC) memory and scrubbing. However, increasing SER in the logic and data flow paths of ICs is a more complex problem. One approach is to employ redundancy in the logic to decrease or correct for SER. However, redundancy is a difficult and costly solution.
It is frequently hard to determine the SER of an IC or system of ICs. One way to perform an SER determination is to actually fabricate the IC or IC system. After fabrication of the IC system, specialized test apparatus may bombard the IC system with cosmic rays and alpha particles in a laboratory environment to create faults or errors. Test apparatus measures the SER of the IC system while bombardment continues. Unfortunately, this approach requires completion of the IC design and fabrication of the actual hardware of the IC system prior to testing. This approach undesirably limits the amount of controllability and observability of the IC design during experimentation.
Another way to determine SER effects is by fault injection into a software simulation or software model of a particular IC hardware design. Unfortunately, this software simulation model approach may be very slow. The size of the software model is also typically limited such that the software model may include just a portion of the IC design rather than the entire IC design when the IC is very large.
Soft error rates in logic have become a threat to the reliable and continuous operation of systems. A characteristic of SER which is both beneficial and challenging is the “derating”. Not every flipped bit is hazardous. Many simply vanish without consequence, and others are caught by hardware and software checkers. The number of upsets that become machine checkstops or silent data corruption events can be very small. The derating is the ratio of bit flips to dangerous events. If every bit flip in a latch or combinational logic circuit needed to be counted in the system failure rate, there would be a huge problem that would make systems unusable. Fortunately, derating can be made large enough in well-constructed designs that SER targets can be met.
The challenge with derating is that there does not currently exist an accurate means for prediction (i.e. pre-production prediction of SER). Currently, the best that can be done is to validate SER after the hardware becomes available. This is achieved by accelerated testing using particle beams for cosmic effects and for measuring derating, and hot underfill (HUF) for measuring alpha particle effects. However, these methods do not work before the hardware is available. An additional limitation is that it is difficult to assess whether an event was derated by the hardware, or alternatively by a software application.
It would thus be desirable to understand derating factors for both hardware and software when assessing the SER risk of a micro-architecture hardware design. Commonly assigned and co-pending patent application serial number 12/022,869 filed on Jan. 30, 2008, and which issued as U.S. Pat. 8,073,668 on Dec. 6, 2011, provides a system and method for determining the derating factors associated with hardware. The following description describes a system and method for determining the derating factors associated with software.