The present invention generally relates to computers, and more particularly relates to detecting soft failures within a computing system.
The next critical resiliency challenge is soft failures where the complex system (cloud, containers within a hybrid, an operating system, middleware, or customer application) continues to work but does not provide the needed service. When this type of problem occurs it has a major impact on the customer's IT solution. The component experiencing the failure is unable to detect that the failure is occurring because most of these problems are caused by legal, but abnormal behavior. Conventional soft failure systems can detect certain abnormal behaviors in real time usually before the operations team has observed or been notified about the problem. However, these conventional systems generally depend on the process being monitored emitting too many artifacts (e.g., message identifiers, LOGREC records or records that include information about an abnormal occurrence within a given computing system, using too many processor resources, etc.). Therefore, in many situations these conventional systems can mistakenly classify a process as “normal” based on a “too many” threshold.