Known information technology (IT) systems that attempt to make services continuously available (i.e., highly available), regardless of hardware and software failures, include: (1) fault tolerant systems and (2) fault resilient systems. A fault tolerant system tolerates any software and/or hardware fault within a system boundary and continues to provide services without any interruption. Every critical component in the fault tolerant system is duplicated, allowing replaceable components to sit idle as standby components, thereby creating a system that is not cost effective. A fault resilient system (also known as (a.k.a.) a high availability (HA) cluster) replicates only a few of the critical software and hardware components to increase overall availability of the system compared to a standalone system. By replicating only some of the critical components, the fault resilient system is an economical alternative to the fault tolerant system.
HA system failover is mostly event-based. The success or failure of the HA system failover in real time can be predicted by monitoring the individual events occurring during the failover. Existing tools, however, cannot predict and investigate the actual root cause of the failure of component(s) associated with a single event or events.