The present invention relates to computers, and more specifically, to the detection of a damaged software system within a computing system.
In many applications, such as those involving mainframe computers (or servers), resiliency with respect to crashes is highly important. Accordingly, systems employed in such applications are designed to be able to manage multiple failures related to the software applications being run on the system without affecting the system as a whole.
An example of such a highly resilient system may be found, for example, in the software stack of a mainframe computer. A “software stack” is a set of programs that work together to produce a result, for example, an operating system and its applications. It may also refer to any group of applications that work in sequence toward a common result or to any set of utilities or routines that work as a group. Of course, the resiliency could exist in other contexts, such as a personal computer, as well.
When a highly resilient system like a mainframe software stack is damaged by a software defect it frequently generates a high rate of critical failures caused by either recurring or recursive failures leading to abnormal ends (abends). Such systems, however, can survive multiple failures often without the failure being visible to the operations team or the users of the services provided by the stack. Given that these highly resilient systems can survive a significant number of failures, operations teams and system users have become used to some number of these failures as normal behavior. However, the combination of these failures and some other event can cause the stack to fail. If the number of failures is excessive (i.e., abnormal behavior), then the stack could fail due to the cumulative effects of all these failures.