The increasing complexity and sophistication of present day hardware systems has led to an increase in the opportunity for operating errors. Many computing system failures stem from hardware errors. Processors, caches, and memories are becoming larger, faster, and more dense, while being increasingly used in adverse environments, such as high altitudes, in space, and in industrial applications. Hardware errors may be characterized as hard errors and transient (soft) errors. Hard errors are those that require replacement (or relinquished use) of a component. Typically, such errors are the product of physical damage. Transient or soft errors are those that result in an invalid state in the hardware that is normally correctable. A typical processor's silicon can have a soft-error rate of 4000 FIT (1 FIT equals 1 failure in 109 h), of which approximately 50% will affect processor logic and 50% the large on-chip cache. Due to increasing speeds, denser technology, and lower voltages, these errors are likely to become more probable than other single hardware component failures.
Techniques such as Error Correction Codes (ECC) and Chipkill (as described by Timothy J. Dell, “A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory” IBM Microelectronics Division, July 1997) have been used in main memories to correct some errors. Unfortunately, such techniques only help reduce visible error rates for semiconductor elements that can be covered by such codes (large storage elements). With raw error rates increasing with technological progress and more complicated interconnected memory subsystems, ECC is unable to address all the soft-error problems. Presently available hardware and software provide little to no support for recovery from errors not covered by ECC, whether detected or not.
One solution to provide increased reliability of hardware systems, and processors in particular, has been fail-over technology or lock stepping. In this system, a second processor operates to check the progress of a first processor, and take over the operation in the event of a failure. While this system may provide increased reliability, the cost is that a second processor must be dedicated to the fail-over support of a first processor.
Alternatively, an operating system may utilize a multiprocessor mode, whereby the operating system divides tasks among and between a plurality of processors. The overall processing speed of such a device is increased for a given operation, since the individual arithmetic and logic operations that make up a larger operation may be performed in parallel. Multiprocessing is most effective when the application software being run is designed for multiprocessing. This design preferably involves structuring the software such that it may be broken into smaller routines that can be performed independently. Even where software does not lend itself well to being broken into such discrete units for multiprocessing, the operating system may still make use of the additional processors through multitasking, where the operating system would assign different applications to different processors.
One problem with presently available processing systems and the software that drives these systems is that they may not switch between a multiprocessor mode and a fail-over processing mode.