Computer or data processing systems typically comprise a plurality of hardware components such as processors, memory devices, input-output devices and telecommunications devices. In addition, such computer systems also comprise a plurality of software components such as operating systems, application support systems, applications, processes, data structures, and so forth. A fault or an error in any one of these hardware or software components can invalidate the results of a computer system action. Much effort has therefore been invested in discovering and correcting such faults and errors.
When a fault or error is discovered in a computer system, a specific action, or series of actions, is taken in an attempt to restore the system to working order. These actions include restarting a software process, reinitializing a data area, rebooting a central processing unit, resetting a piece of hardware, and so forth. In a complicated system, it is often difficult to determine in real time which basic hardware or software component of the system is at fault and requires attention. Since the availability of the entire system is dependent on rapid reacquisition of full working status, an efficient strategy is required to minimize system recovery time.
One strategy often used to minimize recovery time for computer systems is to attempt recovery at the level of the simplest, most elementary component which could have caused the observed error or fault. If reinitialization of that lowest level component fails to clear the error or fault condition, a component at a next higher level (having a larger and more comprehensive function) is reinitialized. If the error is still not cleared, components at ever higher and higher levels are reinitialized until the fault or error condition is cleared. If, either after a predetermined timeout period, or after the highest level component possibly involved in the error or fault is reinitialized and the error condition remains, the automatic recovery system is deemed to have failed and an audio or visual alarm is used to alert attendant personnel to take corrective action. This type of multiphased, staged multilevel procedural strategy for recovering from errors and faults may be called a multistaged system recovery strategy.
The detailed logic necessary to implement multistaged system recovery strategies is complex, expensive and requires a significant development effort. Moreover, as new fault and error conditions are identified during the life cycle of the system, the additions and modifications to the logic of the recovery system become very difficult and expensive. Finally, the actual fault and error conditions, as well as the appropriate corrective actions, may change over the life cycle of the computing system. New faults and improved corrective action sequences may be discovered or may become necessary due to the aging of the components. For all of the above reasons, the design and maintenance of computer system recovery arrangements tend to be costly and unresponsive to new experience gained with the computer system.