1. Technical Field
The present invention relates generally to error/exception recovery. Specifically, the present invention is directed to a method, computer program product, and data processing system for providing optional failure recovery for desired routines in an operating system kernel.
2. Description of the Related Art
Computers are generally, by nature, deterministic machines, but they must operate in a non-deterministic world. Hardware malfunctions, invalid data or instructions, unpredictable user input, and even cosmic radiation from the farthest reaches of outer space can influence the behavior of a computer system in undesirable ways. Ultimately, any truly useful computer system is capable, whether by programming, user input, or hardware malfunction, of producing an undesired result (or, as is often the case, no result). For example, one of the fundamental results of computability theory (the undecidability of the halting problem) is that it is, in the general case, impossible to determine with certainty whether a given program of instructions will terminate (“halt”) or enter into an infinite loop (“hang” or “diverge”) on a given input.
Thus, all useful computers must react at some level to asynchronous, non-deterministic, or otherwise unpredictable events, even if such reaction takes the form of a system crash or hang condition. One of the aims of most operating systems and other runtime environments is to avoid the occurrence of crashes and hangs. For example, most modern operating systems can terminate an application process in the event that the application performs an invalid or illegal instruction or memory access. In these instances, the computer hardware will generally detect the offending instruction or memory operation and raise an exception, causing an interrupt handling routine in the operating system to take notice of the exception and deal with it accordingly, often by terminating the application.
Of course, an operating system kernel is itself a computer program and is capable of experiencing the same malfunctions and other problems as any other computer program. The main distinguishing trait of an operating system kernel is that once the kernel crashes or hangs, usually the entire computer system will crash or hang. Thus, it is imperative for the stability of a computer system that kernel crashes and hangs are avoided at all costs.
Some operating systems, such as the AIX operating system (a product of International Business Machines Corporation), allow certain locations in kernel code to be designated as re-entry points in the event of certain types of failure. In AIX, for example, a call to the function “setjmpx( )” allows the current location in the kernel code to be designated as the re-entry point on failure. Such facilities allow some errors to be addressed within the kernel code by re-entering the kernel code at the designated point with a failure code, but they are limited in the types of failure from which recovery can be performed. In particular, failures that require significant state information to recover from (such as those in which system resources have been obtained subsequent to setting the re-entry point, which should be released during the recovery process) are not appropriate for the “setjmpx( )” approach.
Moreover, it would be desirable to allow kernel recovery features to be implemented gradually across an operating system kernel, so that, in early and intermediate versions of the operating system, some operating system routines may support kernel recovery, while others do not. This would allow operating system updates to be made more easily, so that kernel recovery features may be integrated into an operating system without the necessity of a major rewrite or re-release of the software. Since operating system updates are easily distributed electronically through the Internet (and often automatically installed by the operating system itself), this ability would be immediately advantageous. It would also be advantageous, from a performance standpoint to provide an ability to disable recovery (and the computing overhead associated therewith) when not needed or desired, for performance purposes.
What is needed, therefore, is a method for providing more comprehensive recovery from kernel failures, in which optional recovery features may be gradually added to an operating system kernel. The present invention provides a solution to this and other problems, and offers other advantages over previous solutions.