1. Field of the Invention
The present invention relates generally to computer system software for handling faults, resulting from logic and coding errors, corrupted states in memory, and other hardware failures, that can cause a computer system to crash. More specifically, the invention relates to a virtual machine used for the diagnosis of and recovery from such faults.
2. Discussion of Related Art
Since the time computers were being used in commercial and non-commercial settings on any scale, devising fault-tolerant computer systems has been an important and constantly evolving area in computer science. As computers are used more and more in environments where failures must be avoided as much as possible, fault-tolerant systems have been further developed to best handle unexpected system failures. With current fault tolerant systems, fault diagnosis and fault recovery have generally been separated or isolated from each other. Determining that a fault occurred because of a logic error or a corrupted memory state is a distinct task from actually recovering the system to a normal state so processing can continue. At one end of the spectrum of fault tolerant systems, recovery and restart are emphasized. At the other end of the spectrum, system testing and diagnosis emphasize system modeling, simulation, and analytical methods to obtain reliability estimates, such as proof of correctness and Mean Time To Failure metrics.
Between these two extremes, many software systems react to faults by taking a snapshot of all available state information at the time of the fault. In these systems, fault diagnosis is done after crash recovery by applying human intelligence to the state snapshot. Future recovery from occurrences of the same problem depends on the human analyst providing a fix for the problem which may require a new release of the software.
A common approach to fault tolerance is a checkpoint/restart mechanism with or without redundant hardware. The redundant hardware is used as a standby when the normal system fails. Test/diagnostic equipment depends on simulation and verification of some abstract model of the system. These methods are not always a practical solution for legacy systems, which are cost-sensitive and change due to market forces. These methods add cost and complexity to the system, making the system harder to debug and maintain. Furthermore, the redundant hardware adds to the overall costs of the system.
Systems not designed for fault tolerance have tools for fault diagnosis. One such technique involves taking a snapshot of the system where the snapshot is more complete and is taken at the precise time the fault or crash occurred or is detected. This type of complete snapshot typically provides a wealth of raw system state data that can be used for pure diagnosis and is in a human readable or accessible form, normally with a debugger or crash analyzer. Human intelligence is needed to get from symptoms to root causes and, as such, is labor-intensive and is done off-line, i.e., after unrecoverable damage has been done and the system has crashed. Although the snapshot is more complete, diagnostic information is still limited to the static snapshot of the system. A dynamic response to the fault cannot be determined since the dynamic response is gratuitously altered to capture the static snapshot and to then crash and reboot the system.
When a fault occurs in a system, system state information is unreliable. This makes implementing a sophisticated fault handler problematic since it must work under conditions where correctness of operation is suspect. Fault handlers are software systems and, thus, prone to the same types of failures they are designed to handle. The problem is exacerbated by difficulty in testing the fault handler for the various scenarios it must handle. If the scenarios were known, the fault could have been avoided. Methods to handle faults must consider not only the specifics of the fault but also the context in which the fault occurs. For example, the effect of a fault in an application level process context will differ from the effect of a similar fault in an interrupt handler. It is difficult to test for all possible scenarios. Thus, there is the risk of inadequately tested software attempting to diagnose and recover from an unknown and unexpected state and at a time when system operation is unreliable, making diagnostic/recovery more difficult than would be otherwise. Consequently, it is common to keep the fault handler as simple as possible.
Another method of diagnosing a fault involves using analytical methods, an expert system, or some type of modeling and simulation. These techniques may generate test vectors which are applied to the target system to study its response or to generate measures of reliability or stability. For numerous reasons, such methods are impracticable in applications where there is a rapidly evolving code base, typically in response to market forces. Such methods, used typically in academic settings, require a very stable software code base since much time and effort must go into formulating a model, setting up a test rig, and for data collection and analysis. These methods are off-line and are performed with reference to a model of the system and, thus, limited to that model, which rapidly becomes obsolete.
FIG. 1 is a flow diagram of a generic or abstract process of handling system faults used in the techniques described above and known in the field of fault handling software systems. A system fault handler (typically a component or module in a normal operating computer system), executing concurrently with other processes during normal operation of the computer system, begins with determining whether a fault that has occurred is a fault from which the system can recover at step 102. Recoverable faults are those that the system fault handler has been explicitly designed to handle. If the fault is recoverable, the system fault hander addresses the fault and returns the system to normal operation at step 106.
The emphasis here is on recovery and restart rather than diagnosis and analysis. In a checkpoint/restart system, the fault handler will use a checkpoint snapshot to return the system to a previous state, with the primary goal of simply getting the system back up and running, the goal with the highest priority in most commercial scenarios. If the fault is not recoverable, control goes to step 104 in which a current snapshot of the system is used. This static snapshot is of the system at the time the fault occurred (i.e., snapshot of current system state) and is used to diagnose the problem off-line. The system is brought back up again by having to take the significant step of rebooting, typically the least desirable way of resuming normal operations.
Therefore, it would be desirable to have a fault tolerant system that is capable of performing system recovery and restart and real-time diagnosis of the fault so that the same fault does not occur repeatedly. It would be desirable if the system fault handler consumed a minimal amount of resources by executing only when a fault occurs and not at all times. This also has the benefit of keeping the hardware and software less complex. In such a system, the degree of human analysis and effort spent on a current system state snapshot would be minimized since much of the diagnosis would be performed by the fault handler. It would also be desirable to be able to self-test and monitor the fault handler for various scenarios so that it can more efficiently restart the system and diagnose the fault and its context. It would be desirable for a fault handler process to permit the system to continue operation after an otherwise catastrophic failure in order to get more data on the dynamic effects of the fault or to recover from the fault.
To achieve the foregoing, methods, apparatus, and computer-readable media are disclosed for analyzing and recovering from severe faults in a computer system. In one aspect of the invention, a method of detecting and fixing a normally unrecoverable fault in a computer system is described. An initial fault caused from the computer system operating in a particular and typically expected sequence is recognized in the computer system. This fault is one that could not be handled by the computer system""s normal fault handling processes. Once the fault is recognized as an unrecoverable fault, an alternative mode, or shadow mode, of operation for the computer system is invoked. This mode is used to run a fault handling virtual machine. The alternative mode is used to track and analyze behavior and performance of the computer system once the fault has occurred. Through this process, system state data can be gathered for fault diagnosis and system recovery. The alternative mode then attempts to recover from the fault by dynamically using the system state data to cause the computer system to operate in a different sequence thereby potentially avoiding the fault.
In another aspect of the present invention, a fault handling virtual machine is installed on a computer system upon detection of an unrecoverable fault. The fault handling virtual machine extends the capabilities of the computer system to fault diagnosis and recovery by applying expert knowledge of the computer system. One of the components is a post-fault stable state constructor that constructs a normal operating state for the computer system after a fault occurs. A fault data collector collects specific information on the state of the computer system at the time of the fault. The fault handling virtual machine also includes a fault data examination component for examining the specific information on the state of the computer system after a fault occurs.
In one embodiment, the fault handling virtual machine includes a persistent fault handler that is capable of processing and handling persistent faults that occur in the system once the fault handling virtual machine is invoked. In another embodiment, the fault handling virtual machine includes a fault severity measuring component for determining the severity of a fault by looking at expert knowledge of the computer system and a current fault state.
In yet another aspect of the present invention, a computer-readable media contains computer programming instructions for detecting and fixing a normally unrecoverable fault in a computer system. The programming instructions include computer code for recognizing that an initial fault has occurred by the computer system operating in a particular sequence. The programming instructions also include computer code for invoking an alternative mode of operation for the computer system upon recognizing the initial fault. The programming instructions also cause the use of the alternative mode to track performance of the system after the initial fault thereby gathering additional state information for fault diagnosis and system recovery. The computer programming instructions also prevent a subsequent fault from reoccurring as a result of recovery from the initial fault. This is done by using a dynamic state of the computer system to cause it to operate in another sequence such that the initial fault and the subsequent fault are potentially avoided.