1. Field of the Invention
The present invention is directed to operating a digital information processing system and in particular to a system and method for the processing of error conditions arising during the operation of such an information processing system. More particularly, the present invention relates to a system and method for fault recovery through a recovery environment that operates passively with respect to the executing process.
2. Background of the Invention
Modern information processing systems employ operating system software to manage system function. The operating system provides facilities for scheduling tasks in the data processing system, managing the memory, external storage devices, and other resources of the computer system. Operating systems have evolved into complex software structures which are executed as hundreds of processes on the computer system.
The basic operating system functions are typically grouped into a kernel, a name particularly used with UNIX-based operating systems (UNIX is a trademark of UNIX System Laboratories, Inc.). The kernel is created as a single large operating program with numerous processes that are executed as required.
The operation of any computer system will occasionally encounter system failures due to software errors, hardware errors of equipment failures. When a system failure is encountered, the operation system program must have an ability to analyze the cause of the failure and, if possible, correct the problem and continue processing. Error recovery is particularly important in large multiuser systems where it is impractical to stop and restart the entire system to recover from failure.
Historically, a unique recovery routine is specified and constructed by each operating system process. At the request of the executing process, the kernel establishes a recovery environment based on that routine. In operation, a prior art kernel proceeds through the states as shown in FIG. 1. The operating system first selects the process to be initialized 102, establishes a recovery environment 104, and begins executing the process 106. If the execution is successful, the system terminates the recovery environment 108 and initializes the next process 102. The detection of a failure during the execution of process 106 causes invocation of failure processing routine 110 which enters the appropriate recovery environment 112 and attempts to correct the failure. If the recovery is successful and would allow continued execution of the process, control is returned to process execution 106. Otherwise, resources are freed, e.g., by releasing resource locks, computations are restored to a prior safe state, and the routine terminated. The termination process then terminates the recovery environment 108 and returns to the scheduling of processes 102.
The prior art suffered from the technical problem of creating high system overhead due to the need to establish a recovery environment and terminate that environment for the execution of each process. Because failures occur infrequently, the establishment of a recovery environment on every process creates a large and non-productive overhead. In some operating systems this overhead can consume as much as 20 percent of the computer system capacity.
It would, therefore, be desirable to provide a recovery mechanism that was established and invoked only as required by failure by the computer system. This would result in a large reduction in operating system overhead without the loss of recovery function. The technical problem to be solved is to provide for processing failure recovery without incurring the high processor overhead currently consumed.