An important element in creating a highly reliable computing system is the handling of errors such as hard errors and soft errors. Soft errors occur when alpha particles or cosmic rays strike an integrated circuit and alter the charges stored on the voltage nodes of the circuit. If the charge alteration is sufficiently large, a voltage representing one logic state may be changed to a voltage representing a different logical state. For example, a voltage representing a logic true state may be altered to a voltage representing a logic false state and any data that incorporates the logic state may be corrupted. This is also referred to as data corruption.
Soft error rates (SERs) for integrated circuits, such as microprocessors (“processors”) increase as semiconductor process technologies scale to smaller dimensions and lower operating voltages. Smaller process dimensions allow greater device densities to be achieved on the processor die. This greater density increases the likelihood that an alpha particle or cosmic ray will strike one of the processor's voltage nodes. Lower operating voltages mean that smaller charge disruptions may alter the logic states represented by the node voltages. Both trends point to higher SERs in the future. Consequently, soft errors should be handled appropriately to avoid data corruption and other errors that may caused by soft errors.
Hard errors occur when components or devices in a computer system malfunction. Components or devices in a computer system can be damaged a number of ways such as by voltage fluctuations, power surges, lightning and heat. If these hard errors are not discovered and corrected, data corruption along with a complete system failure is likely.
The process of error handling consists of error detection and error recovery. Error detection is typically accomplished in the processor or system logic hardware through the addition of parity check bits in the memory arrays, buses and data paths.
Error recovery may include error containment and system availability. Error containment and system availability often conflict with each other. Error containment is preventing an error from propagating to other computer devices, components or system logic. System logic is the portion of the logic in a computer system that enables the processor, memory and input/output (IO) devices to work together.
Computer systems often reboot in an attempt to contain errors. While rebooting, the computer system is not available. Frequent rebooting of personal computers may be somewhat acceptable even though it is highly annoying. However, frequent rebooting of high availability systems, such as system servers, is not acceptable. System servers, such as mail servers and network servers, are generally relied on to run critical applications in a non-stop fashion.
Another consideration in error recovery is the error recovery time. The error recovery time is the time it takes for error recovery to be completed. While error recovery is being performed, operating systems lose control of the computer system. Many modern operating systems, such as Windows NT and Unix, cannot tolerate a loss of control of the system for a significant time while the system is going through error recovery.
Multiple processor (MP) computer systems further complicate the problems of error recovery and error recovery time. In MP computer systems, different processors are executing different processes. One or more of the processors may encounter the error but all of the processors can be affected. Generally, MP computer systems lack a coordinated approach to error recovery. This lack of an appropriate error handling can cause MP computer systems to reboot unnecessarily and data to be corrupted.
Additionally, error handling of today provides only limited error information without any specific format. In many cases, error handling of today provides no error information. Forcing a computer system to reboot is bad enough, but having your computer reboot without obtaining information about the error that caused your computer to reboot is even worse.
Not all errors encountered in a computer system can be recovered from. However, current error handling fails to provide enough error information.
For the reasons stated above, and for other reasons stated below which will become apparent to those skilled in the art upon reading and understanding the present specification, there is a need in the art for a computer system that handles errors in a coordinated manner.