The present invention relates to operating system resource management, and more specifically, to systems and methods for detecting soft failures affecting operating system resource managers.
Soft failures or “sick, but not dead incidents” are a class of failure in complex computing systems such as a multi-mainframe operating system, which can't be detected by the stateless individual components. Soft failures are typically caused by rare combinations or sequences of legal events. Current techniques implement existing historical data and mathematical modeling to predict normal behavior and exceptions are issued when there is a significant deviation from normal, thereby indicating a soft failure.
The problem with the current approach to detecting soft failures is that tools use existing historical data already captured for some other use which lacks the necessary granularity, or require a change in the behavior of a component by inserting an agent that destroys the statelessness of the component. Furthermore, there is a need to detect not only when the component is experiencing a soft failure, but when specific transactions being processed by the component experience a soft failure. A soft failure within a transaction can create several issues including but not limited to: 1) a class of transactions which never completes; 2) a class of transactions which never runs to successful completion; and 3) a class of transactions with excessive or unusual numbers of failures. Conventionally there is no way to externally monitor for soft failures related to transaction processing without changing existing resource managers.