A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever. The following notice shall apply to this document: Copyright(copyright) 1999, Microsoft, Inc.
The present invention pertains generally to fault testing of computer software, and more particularly to fault testing system software for handling of uncommonly occurring conditions.
System software has traditionally been complex to test. The operational system must be able to handle a variety of exceptional conditions that, while occurring only occasionally under typical operating conditions, are potentially serious. Some of these conditions include, for example, disk I/O failure, network communications timeout or failure, and out-of-memory failure. Because these conditions occur so rarely, conventional strategies for testing the ability of the system software to handle these conditions have involved large scale stress testing or artificial fault induction. In large scale stress testing, the operating system is used for extended periods of time so that the exceptional conditions are likely to occur naturally. A significant drawback to this approach is the substantial time required for testing.
Artificial fault induction avoids this problem to some extent by simulating faults. For example, random requests to low level routines can be failed. Alternatively, a routine can be failed after a predetermined number of calls to the routine. While fault induction approaches reduce the time involved in testing system software, they are susceptible to a phenomenon known as call path skew. For example, a routine that can throw a low level exception can be called from two other routines A and B, of which routine A is called much more frequently than routine B. If routine B cannot properly handle the exception thrown by the low level routine, conventional fault induction approaches may miss throwing the exception when called by routine B because routine B is called so infrequently. Some approaches use a random number generator to limit the frequency with which the called routine throws exceptions, allowing the system to proceed effectively while under the test workload. Even with this limiting measure, however, most exceptions are thrown in the context of routines that are called most frequency. Functions that are not called often receive few, if any, exceptions to handle.
Thus, conventional fault induction approaches may fail to detect the inability of an infrequently called routine to handle exceptional operating conditions. Accordingly, a need continues to exist for a system that can detect such errors in even infrequently called program modules.
According to various example implementations of the invention, there is provided an efficient system for fault testing system software, as described herein below. In particular, the invention provides, among other things, for the use of a hash table or other data structure for tracking routines that have been subjected to induced faults and exceptions during testing of the system software. The hash table or other tracking mechanism is consulted as routines are encountered while running a test workload to determine which routines have not yet been subjected to induced exceptions. These routines are then subjected to induced exceptions.
Because the routines that have been subjected to induced exceptions are tracked, the system can induce a more uniform distribution of exceptions for all routines, especially those that are encountered only rarely under typical operating conditions. Thus, the system is able to ensure proper exception handling by all routines.
In a particular implementation, a ring buffer is used to track routine paths and to store associated parameters in order to preserve events preceding a failure to handle an induced exception or fault properly for debugging.
In another particular implementation, program addresses are used to uniquely identify each path that has been subjected to an induced exception.
Another implementation is directed to a method for testing fault tolerance by executing the computer program, maintaining a record of paths in the computer program that have been subjected to an induced exception, and inducing an exception if a path currently being executed has not already been subjected to an induced exception.
Still other implementations are directed to computer-readable media and computer arrangements for performing these methods.