This invention relates to fault recovery in software systems.
A stored program controlled computer system comprises a basic processor with an operating system (OS) that provides a first-level user interface. Examples of operating systems are UNIX (a trademark of UNIX System Laboratories) and WINDOWS (a trademark of Microsoft). Programming languages such as C and C++ and their runtime environments provide a second-level user interface to facilitate programming on a computer system. Sometimes, even higher-level languages such as Java (a trademark of Sun Microsystems) and their runtime environments form a third-level user interface to further ease application programming. Each higher lever user interface attempts to reduce the expertise required of the computer system's programmer while (typically) reducing the programmer's control over the computer system.
Increasingly, server systems—which are computer systems that serve a plurality of clients—are partially or completely implemented in the Java programming language.
Most such servers have stringent availability requirements, which translate to a need for very reliable software. One reason for Java's popularity in server embodiments is the perception that Java's features help writing reliable code quickly. A variety of mechanisms are available in Java that help avoid many programming errors that are common in other languages. This includes, for example, pointerless programming, automatic garbage collection, type safety, the integration of threads as objects, and an extensive library of convenience classes. Other mechanisms, such as Java's exception handling, encourage writing code that recovers from errors, thereby adding to the reliability of the resulting program.
Even a very reliable server, however, is likely to encounter an execution error at some point in time. The behavior of a server in response to an error is another factor that strongly influences its availability, and here, too, Java has certain advantages. For example, if an error occurs in a Java program that is not caused by a virtual machine fault and which violates the semantics of the Java language, then the Java Virtual Machine (JVM) raises an exception and either invokes the appropriate user-supplied exception handler or, in the absence of such an exception handier, terminates the failed thread (a thread is a concurrent unit of execution inside a process, numerous threads can execute in a particular process, and numerous processes can be part of an application, and a thread can depend on another thread). This type of controlled reaction to an error often isolates the error to the failed thread. This stands in contrast to a language such as C and C++, where errors can spread without immediate detection and containment, and are often caught only by the operating system, which can result in the complete failure of the parent application.
To illustrate this point, consider a server implemented in C or C++ with no additional fault tolerance provisions. Suppose this server had a dedicated client handler thread for each current client and that one of the client handler threads modifies an array element beyond the allocated range of the array. There are several possible outcomes of this error. One is the corruption of data in the server that may lead to serious failures later on. Another possible outcome is the access of a memory location outside the address space of the server process. The operating system will detect this memory access attempt and shut down the entire server application (fail-stop behavior), potentially resulting in the loss of work that the server had performed on behalf of other clients that depended on the application that is shut down, and the temporary loss of availability of the application for all of the server's clients.
On the other hand, when the server is implemented in Java and one of its client handler threads attempts to modify an array element that is beyond the allocated range of the array, the Java virtual machine will immediately detect the error and raise an ArrayIndexOutOfBoundsException. There is a good chance that the server does not include a programmer-supplied exception handler for this exception, and, in such a case, the JVM terminates the offending client handler thread. Thus, the JVM contains the failure in the offending thread, and all of the other client handler threads are unaffected, continuing to be available to their clients. In other words, the failed thread is terminated but neither the parent application, nor the other clients of that application are affected.
Although this Java failure behavior, which can be called graceful degradation, is superior to the C/C++ failure behavior with respect to server availability, it is not ideal. Graceful degradation does not restore the availability of a server. It merely limits the damage done by a failure. If the failed thread is pivotal, in the sense that the proper functioning of a server depends on it, graceful degradation will not enable the server to continue providing its intended service. Even if the failed thread is non-pivotal, graceful degradation will prevent immediate disastrous results, but the availability of the server will degrade, and may gradually worsen to a point where the server will cease to function. This type of failure can be particularly insidious because graceful degradation, unlike fail-stop behavior, allows the server to operate with reduced availability, which can be difficult to detect by the users and administrators of the server.
There is also a wide range of failures that the Java runtime environment (JRE), i.e., the combination of JVM and Java runtime library, does not detect at all. This includes failures such as thread starvation, thread deadlock, excessive garbage collector activity, machine or process crash, thread hangs, etc. The reason is that, basically, all that the JRE does is log a stack trace of a failed thread. While this is certainly helpful, it is often insufficient for debugging purposes.
The art has attempted to assist programmers with failure analysis and recovery tools. One such tool is described by Huang et al in “NT-SwiFT: Software Implemented Fault Tolerance on Windows NT,” Proceedings of the 2nd USENIX Windows NT Symposium, pp. 47–55, Seattle, Wash., USA, August 1998. This tool can detect the crash of a watched target application and restart a crashed application. However, it can only catch failures with a fail-stop behavior, and such failures are relatively rare in Java programs because of Java's graceful degradation behavior. They occur only due to a fault in the Java virtual machine, due to a fault in a native code attachment to a Java application, or due to external influences such as a machine crash or a forceful shutdown of the target application through an operating system command, e.g., a kill signal.
Bernhard Plattner in “Real-Time Execution Monitoring,” IEEE Transactions on Software Engineering, SE-10(6), pp. 750–764, November 1984, describes a supervisor for real-time applications called a real-time monitor. However, the functionality of, and the real-time constraints for, this real-time monitor require special hardware support and the need for two machines, as well as access to the source code of the target application.
An application supervisor for the ADA programming language is described by DiMaio et al in “Execution Monitoring and Debugging Tool for ADA Using Relational Algebra,” Proceedings of the ADA International Conference on ADA in Use, pp. 109–123, 1985. Like the real-time monitor, the ADA supervisor lets the user specify failure detection and recovery policies. However, the ADA supervisor requires source code editing of the target program and subsequent recompilation.
The Meta system described by Marzuilo et al in “Tools for Distributed Application Management,” IEEE Computer, 24(8) pp. 42–51, August 1991, is a toolkit for writing external application supervisors for distributed target applications, written in C and running, for example, on UNIX. Meta allows the placement of powerful generic and customized sensors (probes placed in the target application that can be queried to deliver information about the program or the program state) and actuators (code fragments that affect program execution). While Meta allows the coordinated supervision of all remote components of a distributed application, it requires detailed knowledge of, changes and additions to, and recompilation of the target source code.