When a hardware fault is detected in a digital computer system, the failure condition is often so severe that the only way to recover from the failure state is to perform a complete system reset. A reset of the system usually occurs through a manual action (reboot, restart, etc.) to bring the system back to normal operation. Any failure impacts the availability of the computer system and the productivity of its users and business.
Some proprietary mainframe or mid-range computers which are used in mission critical computing utilize special hardware and software in conjunction with a separate processor to perform some level of system recovery from the failure. Unfortunately, providing special hardware and software adds significant cost to the development budget, cycle time, and product cost. Further, these cost problems hinder the use of special hardware and software in the lower end of the computer market segment, which utilizes primarily "off the shelf" and industry-standard components for system design.
In order to provide restoration of operations in lower end computer systems, some hardware computer vendors have added additional "intelligent" hardware which reboots or restarts the system upon detection of some computer system failures. Other vendors for operating systems have added restart capabilities in software by branching the instruction execution to a specific system firmware address for system failures. However, each of these approaches remains independent from one another, while only addressing a subset of the computer system failures.
Accordingly, what is needed is an improvement of computer system availability through a more integrated, comprehensive, and flexible approach for recovery from system failures that is cost effective and efficient.