1. Field of the Invention
The present invention generally relates to computer software testing and, more particularly, to a run time self-testing probe that provides a mechanism to detect and reveal failed software modules and assist in system recovery. The invention has applications in both single system and multi-system environments.
2. Description of the Prior Art
Software failure has continued to be a major concern in system reliability as it can cause loss of availability in either the entire system or specific subsystems. However, if these failures are restricted to a subset of the system, it is possible for certain services to remain available while others are not. If the failure of subsystems, services and/or modules is detected promptly, the overall availability of the system can be improved via takeover, workload distribution or other recovery mechanisms. Detection of failed components of the system is an essential part of designing systems for high availability, and early detection can limit the damage done to other parts of the system by decreasing the propagation of errors.
Software diagnostic systems are known in the prior art. For example, U.S. Pat. No. 4,595,981 to Leung discloses a method for the automatic testing of large, incrementally developed programs. This method monitors variables passed between modules and compares them to specified inputs. Program execution is suspended at a point where one module calls another to effect verification of input values. Thus, the Leung method is not intended for run time failure detection.
Brian Randell in "System Structure for Software Fault Tolerance", IEEE Trans. on Software Engineering, Vol. SE-1, No. 2, June 1975, pp. 220-232, discusses acceptance tests that detect software errors within a recovery block. Acceptance tests are local to the part of a program within the module that pertains to local variables and logic. The acceptance test does not capture system level service information that is necessary to identify software faults of the type that occur in the field. Furthermore, field faults are often caused by the interactions of multiple modules and timing problems that occur at increased workloads that the acceptance test is not designed to identify and isolate.
What is needed is a way to detect software failures, some of which may be incipient or hidden, with no interruption to program execution. In other words, a technique needs to be provided which will continually monitor a software system comprising many components as in a large mainframe system. These components run asynchronously and failures in this type of software are known to have large latencies. Ram Chillarege and Ravishankar K. Iyer, in "Measurement-Based Analysis of Error Latency", IEEE Transactions on Computers, Vol. C-36, No. 5, May 1987, pp. 529-537, have reported that their measurements have revealed latencies ranging from a few hours to a few days. Furthermore, these latent errors are known to surface with changes in workload such as those that occur between batch and on-line transaction processing as reported by Ravishankar K. Iyer, Steven E. Butner and Edward J. McCluskey in "A Statistical Failure/Load Relationship: Results of a Multicomputer Study", IEEE Transactions on Computers, Vol. C-31, No. 7, July 1982, pp. 697-706. Ram Chillarege and Nicholas S. Bowen in "Understanding Large System Failures--A Fault Injection Experiment", Dig. 19th Int. Symp. on Fault Tolerant Comp., June 1989, pp. 356-363, describe the use of fault injection to understand failure characteristics of a large system. This paper reports the discovery of errors termed "potential hazards" that remain dormant (latent) until a major shift in the work load occurs. These errors are likely to cause the work load dependant failures reported in the literature.