The invention relates to a system and method for protecting a computer operating system from unexpected errors, and more particularly to a system and method for improving application stability under the Microsoft WINDOWS operating system.
Multitasking, graphics-based operating systems such as Microsoft WINDOWS 95 demand a high degree of expertise from an application programmer. The difficulties inherent in writing synchronized program code in an event-driven, multitasking environment, coupled with a vast and changing system application program interface ("API") consisting of thousands of functions, inevitably results in the production of software programs that contain errors, or "bugs," at several points. Even if an application program is tested relatively thoroughly, some portions of the program code may not be sufficiently exercised to locate the errors. And even if the erroneous portion is executed during testing, it may cause seemingly benign errors that pass undetected.
User input to software, through the keyboard, mouse, etc., is frequently unpredictable. Because of this, an application may attempt to process a combination of parameters that was not anticipated by the programmer. In this case, too, the program may respond in a benign manner, or in some circumstances may cause certain regions of memory to be inadvertently altered, or "corrupted." Those memory regions might "belong" to the program being executed, or might belong to the operating system or another loaded program. Similarly, the corrupted regions might include important data, or they might be unallocated storage. It generally is not possible to be able to determine, in advance, what regions of memory a defective program might attempt to access.
In some circumstances, a programming error may trigger a CPU exception if the program attempts to perform an illegal operation. A CPU exception is the central processing unit's response to an error condition, whether expected or unexpected. For example, an attempt to perform an undefined mathematical operation (such as dividing by zero), an attempt to access a memory location that does not exist, or an attempt to execute code that does not satisfy the CPU's syntax requirements, will typically result in a CPU exception. However, not all CPU exceptions result in a "crash" of the system. A CPU exception will cause a software interrupt. That is, when a CPU exception is encountered, processing immediately stops and is transferred to another program location.
That other program location can contain a segment of program code designed to take whatever action is intended by an operating system programmer. For example, an error message can be presented to the operator. Alternatively, if the CPU exception was expected, then other processing can be performed. Such an exception-handling scheme is used in Microsoft WINDOWS and other operating systems to handle "virtual memory," in which disk storage is used to virtually increase the amount of system memory. Some of the contents of system memory are "swapped out" to disk and removed from memory. Upon a later attempt to access those contents, a CPU exception will occur because the contents sought do not exist within system memory. The operating system will then handle that expected CPU exception condition, bring the contents back into system memory, and allow the operation to proceed.
Most complex operating systems, including Microsoft WINDOWS 95, use CPU exception handling techniques in performing a wide variety of operations. Even so, in many cases, a CPU exception will reflect an error or malfunction. In such cases, the operating system will typically not be able to correct the malfunction, and can only present an error message (typically a cryptic one, useless to all but the most experienced and knowledgeable programmers) to the computer operator.
Depending on the nature of the malfunction, and the action, if any, that the operating system takes in an attempt to block or remedy the malfunction, the offending program can perform in one of numerous ways. The system may stop executing and appear to be deadlocked. The application may continue executing despite the possibility that important data has been corrupted. The application may be shut down by the operating system, or may so adversely affect the operating system itself that the computer must be restarted with an accompanying loss of data.
One goal of operating system design is to minimize the possibility of data loss, and the general trend for the most advanced operating systems, such as Microsoft WINDOWS NT, has been to shield (as far as possible) the memory regions containing the operating system's code and data from the reach of an application program. In other words, an application program can alter itself and its own data, but would be entirely unable to affect any other portion of the system, including other application programs and the operating system itself.
However, a rigorous implementation of this architecture may not be feasible in a mass-market operating system which is designed to operate on lower-cost systems, which typically have slower CPUs and tighter system memory constraints. Therefore, the Microsoft WINDOWS 95 operating system, which substantially retains the memory architecture of earlier versions of WINDOWS, remains highly susceptible to many types of program errors. In fact, it is relatively easy to write code that will crash the operating system.
One program of this kind is discussed in Schulman, Unauthorized Windows 95 (IDG Books 1994), and is available from //ftp.ora.com/pub/examples/windows/win95.update/ unauthw.html. This program, RANDRW, purports to measure the susceptibility of various operating systems to serious program errors. According to its author, RANDRW makes random memory accesses across the memory range of the system. An access is deemed a "hit" if it is allowed to proceed without being blocked by the operating system. In the WINDOWS 95 environment, Schulman reported a hit rate of approximately 1.5%, indicating that improper accesses were being allowed to occur. It should be noted that the 4 gigabyte address space in which WINDOWS 95 runs is generally about 90% unused and uncommitted, so that the 1.5% hit rate within the 4 gigabyte range translates into a much larger percentage of wrongful memory access and data corruption.
A breakdown of RANDRW memory accesses by address has shown that almost all of the core WINDOWS system components are susceptible to being corrupted in this way. The ease with which a 32-bit application program can affect critical system memory is especially alarming because the entire address range of the processor, including the address ranges occupied by critical system components, is within the accidental reach of the program. Older 16-bit programs are able to reach a narrower extent of system resources, but are still able to cause serious damage.
Unfortunately, it is practically impossible to predict the manner of a malfunction. When one occurs, it is correspondingly difficult to remedy the malfunction so that the program that caused it is able to proceed. If there is an isolated stray access, it may be possible to block the access with no appreciable affect on the program. More likely, an application program was attempting to perform a certain operation when it went awry, and its failure to accomplish the operation will affect further operations. Hence, one fault results in another, and the entire course of the program is altered. In certain circumstances, the CPU context of the program may become damaged. For example, an unbalanced stack may cause the stack pointer to be reset, thereby making continued execution of the program impossible and a haphazard restoration of the CPU context unavailing. A side effect of this latter kind of error is that fault handlers built into the program (even those outside of the application program but executing at the same CPU privilege level as the program) will probably also be unable to execute or will themselves malfunction in the attempt.
In addition, one further type of application failure can be identified, in which the application appears to be deadlocked because it is improperly executing an infinite loop. A failure of this kind will not result in a CPU fault and may not cause any data to be corrupted. However, because the program is essentially deadlocked, it might not accept any further input, necessitating a forced shutdown with data being lost.
One prior attempt to address these issues is embodied in the software utility called FIRST AID, various versions of which have been available from Cybermedia, and similarly in subsequent products such as NORTON CRASH GUARD from Symantec and PC MEDIC from McAfee Associates. In FIRST AID, an assumption is made that the architecture of almost all WINDOWS programs is founded on a core piece of program code called the "message loop." In general, after an application program is initialized by creating one or more windows to be displayed on the desktop, it enters the message loop, from which it exits only when the program is terminated. The message loop itself consists of a series of prescribed WINDOWS API function calls that pick up user input and other messages from a system-managed queue, associate them with one of the application's windows, and dispatch them to the message handling procedure of the appropriate window for processing.
The majority of an application's program code is contained in its window procedures, and is caused to be executed either, in the first case, indirectly when a message is dispatched from the message loop, or in other cases, by the WINDOWS operating system bypassing the message loop and calling the window procedure directly. Although there are certain other means by which an application's program code can be executed, these are in a minority. Therefore, when a program malfunctions, it is likely to be executing code contained in its window procedures in response to some message.
FIRST AID makes the assumption that the specific message input that caused the error may not be repeatable, and that it may not be necessary to complete processing of the specific message input. Instead, FIRST AID attempts to enter a new message loop at the point that otherwise the program would have been terminated. For this purpose it installs a driver that gains control whenever a CPU fault occurs. Executing within the context of the faulting application, the driver alerts the user to the error condition, and allows him to decide to terminate the application, as would happen normally, or to reactivate it. Reactivating the application consists of a series of steps intended to ensure that certain abnormal conditions are reset, such as enabling input to the application's visible windows. The driver then enters its own message loop, which is probably fundamentally similar to that contained in the faulting program. Ideally, this will restore the appearance of activity to the application, and the user will be able to access the application's menus and controls at least long enough and well enough to save the application's data to disk.
In less than ideal conditions, however, the method of FIRST AID and subsequent products may be limited to a certain class of application errors, may crash the program by offering to recover it from an error that would not have turned out to be fatal, or may cause the operating system itself to become deadlocked, requiring a system restart. Furthermore, by assuming that the error occurred while the program was executing its own code, FIRST AID ignores the possibility that the error may have occurred within the WINDOWS graphical user interface ("GUI") subsystem. Consequently, by creating a GUI interface (such as a "dialog box") by which the user can choose to recover from the error, and by issuing WINDOWS API calls from within the new message loop, the WINDOWS subsystem may be reentered and further corrupted. The Microsoft documentation for the WINDOWS API function "InterruptRegister" notes in this regard that a fault callback procedure may "execute a nonlocal goto to a known position in the application . . . . This type of interrupt handling can be hazardous; the system may be in an unstable state and another fault may occur. Applications that handle interrupts in this way must verify that the fault was a result of the application's code." However, such verification is not made.
In addition, FIRST AID and the other known products utilize WINDOWS Kernel services, such as those contained in the "ToolHelp" library, in order to trap the error conditions, and therefore the error handling and recovery code in these products executes at the same CPU privilege level and in the same CPU context as the faulting program. However, as discussed above, depending on the nature of the error (e.g. if the program's stack pointer has been corrupted), it may be impossible or inadvisable to perform any significant operation from within the fault handling procedure, including attempting to reactivate the program by reentering its message loop. Alternatively, stack fault errors may cause the fault handling code to be entered using a separate stack from the one used by the faulting program, in which case FIRST AID will not attempt to return to the original stack prior to resuming the program.
Moreover, certain faults do not cause the fault handling procedure to be executed at all, for example if the original fault ultimately results in another fault occurring within the WINDOWS Kernel as it is attempting to call the fault handling procedure. Finally, neither FIRST AID nor other crash protection implementations provide any safeguards that prevent a malfunctioning program from corrupting the WINDOWS Kernel or other system components.
Another known protection method, embodied in Symantec's NORTON CRASH GUARD product for WINDOWS 95, provides crash recovery as generally described above, and also allows deadlocked applications executing in infinite loops to be reactivated. NORTON CRASH GUARD accomplishes this by providing in its interface an option to reactivate a program that NORTON CRASH GUARD has adjudged to be deadlocked. However, in order to activate the NORTON CRASH GUARD interface and hence reactivate the deadlocked program, the WINDOWS GUI subsystem must be able to perform a focus switch away from the deadlocked program to the NORTON CRASH GUARD interface. Depending on the nature of the deadlock, this may not be possible. For example, it may not be possible to invoke the NORTON CRASH GUARD interface when the deadlocked program causes the system itself to appear deadlocked because of holding certain resources that the system must acquire in order to activate another program.
Consequently, in view of the known limitations of prior crash protection utilities used in the MICROSOFT WINDOWS environment, it would be desirable to have a utility that is not so limited. Specifically, such a protection utility would allow applications to safely recover from most unanticipated CPU exceptions, at least long enough to save any data. Such a protection utility would also safeguard the operating system from being corrupted by an errant application program, thereby enhancing overall system stability.