1. Field of the Invention
This invention relates to the recovery from a software or hardware error in a data processing system. More particularly, the invention relates to an error recovery subsystem which is easily reconfigured, and a method for recovering from an error and a program product therefor.
2. Description of the Related Art
Computer or data processing systems typically comprise a plurality of hardware components such as processors, memory devices, input/output devices and telecommunications devices. In addition, such systems also comprise a plurality of software components such as operating systems, application support systems, applications, processes, data structures, etc. A fault or an error in any one of these hardware or software components can invalidate the results of a computer system action. Much effort has therefore been invested in discovering and correcting such errors.
When an error is discovered in a data processing system, a specific recovery action, or series of actions, is generated to restore the system to working order. These actions include restarting a software process, reinitializing a data area, rebooting a central processing unit, resetting a piece of hardware, etc. In a complicated system, it is often difficult to determine in real time which basic hardware or software components of the system are at fault and require the attention of recovery actions. Because the availability of the entire data processing system is dependent upon a rapid reacquisition of full working status, an efficient strategy is required to minimize system recovery time.
One known method for recovery from a detected error is to examine all known system variables to precisely determine the state of the data processing system. The actual system state is then compared to all possible system states for which a sequence of recovery actions is known. The possible system states are referred to as "error states" and are retained in system memory. If the actual system state matches an error state, the sequence of recovery actions associated with such error state is invoked.
The detailed logic necessary to implement an error recovery subsystem is complex and often requires a significant development effort. The large number of system variables in a data processing system results in an immense number of system states which must be detectable, and in an immense number of error states which must be retained in memory. Moreover, although new error conditions are frequently identified during the life of the data processing system, additions and modifications to the logic of an error recovery subsystem are very difficult and expensive. For example, the logic used to program the system must be redesigned to retain and utilize new error states and their associated sequences of recovery actions as they are discovered. In addition, redesign is necessary as the appropriate sequence of recovery actions for a given error state changes due to aging of the data processing system components. The design and maintenance of error recovery subsystems thus tend to be costly and unresponsive to the experience gained during the life of a data processing system.
One additional strategy used to minimize recovery time for data processing systems is to attempt recovery at the level of the simplest, most elementary component which could have caused the observed error condition. If reinitialization of that lowest level component fails to clear the error condition, a component at a next higher level (having a larger and more comprehensive function) is reinitialized. If the error is still not cleared, components at ever higher and higher levels are reinitialized until the error condition is cleared. If, after a predetermined time-out period or after the highest level component possibly involved in the error is reinitialized, and the error condition remains, the error recovery subsystem is deemed to have failed and an alarm is used to alert personnel to take corrective action. This type of multi-level procedural strategy for recovering from errors is known as a multi-staged error recovery system.
U.S. Pat. No. 4,866,712 discloses an error recovery subsystem which is somewhat modifiable. The error recovery subsystem includes a user editable error table and a user editable action table. The error table has one entry for each possible error state and contains a count increment for each sequence of recovery actions that might be taken to correct that error condition. The action table includes action codes uniquely identifying each sequence of recovery actions and an error count threshold for each possible sequence of recovery actions. The subsystem accumulates error count increments for each possible sequence of recovery actions and, when the corresponding threshold is exceeded, initiates the associated sequence of recovery actions. Because the error table and action table are user editable, the subsystem is easily modified to account for new error states, to associate a different known sequence of recovery actions with a particular error state, and to adjust the error count thresholds. It is unclear, however, how to cope with the very large number of system variables in determining the system state. Also, although one can change the sequence of recovery actions (from one specified sequence to another specified sequence) associated with an error state by changing the action code, there is no simple way to create a new sequence of recovery actions as the system ages. Instead, the logic must be redesigned. Even if the error recovery system is implemented as software/microcode programming, such program must be modified and then recompiled as a new code load before installation, thereby slowing system maintenance. In addition, the particular error recovery subsystem disclosed is limited to multi-staged error recovery systems.