Computer systems include a central processing unit (CPU) that provides the processor and the hardware necessary to support the operation of the processor. The processor typically executes a special program called an operating system. The operating system provides an interface that is used by user application programs that are executed on the computer system. The operating system provides a consistent interface for the user programs and hides the specific configuration details of the computer system from the user program.
Computer systems do not always operate without error. System error detection, containment, and recovery are critical elements of a highly reliable and fault tolerant computing environment. While error detection is mostly accomplished through hardware mechanisms, system software plays a role in containment and recovery. In prior art computer systems the operating system software typically attempts to handle errors that are detected in the operation of the computer system. Typically, an error will generate an interrupt which is a hardware jump to a predefined instruction address called an interrupt vector. The software at the interrupt vector uses processor mechanisms to store the state of the processor at the time the interrupt was generated so that execution may be later resumed at the point where the executing program was interrupted. The interrupt mechanism is not exclusive to error handling. Interrupts are also used to allow higher priority programs to interrupt the execution of lower priority programs. A typical use for interrupts is to allow the processing of data transfers as needed. The degree to which this error handling is effective in maintaining system integrity depends upon coordination and cooperation between the system CPUs, platform hardware fabric, and the system software.
The operating system may not be able to correct or even handle some errors that generate interrupts. For example, the operating system may not have access to all the information needed to correct the error or the error may prevent execution of the operating system software. These problems are aggravated by the increasing speed, size, and complexity of computer systems.