The growing complexity of large infrastructures, such as datacenters, frequently hinders the understanding of the system behavior. System administrators frequently analyze metrics extracted from components of the system, relationships between components of the system, as well as the overall system itself. Root cause analysis (RCA) systems enable system administrators to identify the cause of a particular failure in a monitored system. The RCA system performs calculations (e.g., evaluations of diagnostic models for specific components or devices) that enable the RCA system to detect a failure and identify its cause.
Failures of a component may have different characteristics. For example, the failure could be a single component failure, a multi-component failure, a local failure, or another type of failure. When a failure prevents the RCA system from performing calculations, the consequence is the inability to output the diagnosis of the monitored system. Examples of failures may include the device is out of network coverage, the device has insufficient power, a regional or global network overload, lack of resources (which may cause complex calculations to fail), and the like.
Even when the failure is not located in a particular component or device (a “resource”), the diagnostic model for the particular component or device may depend on input from another device or component that may have failed. In current RCA systems, the entire diagnostic process may be restarted to enable the missing calculation to be performed. However, a full restart is inefficient and consumes too many resources (e.g., computer memory, CPU, or energy).