Failure diagnosis is the process of discovering the root cause of occurred failures based on a set of observed failure indications in the system. Fast and accurate diagnosis is essential to maintain the high availability of current computing systems.
The study of failure diagnosis in computing systems has gone on for quite a long time. Traditional approaches rely on profound understandings of the underlying system architecture and operational principles to build system models or a set of rules for the diagnosis. As the increasing complexities of current computing systems, however, it becomes hard to build a meaningful model or precise rules to facilitate the failure diagnosis. As an alternative, statistical learning based approaches received more attentions in recent years. Those methods identify the failure root cause by analyzing and mining a large amount of monitoring data collected from the failure system to characterize the failure behavior. However, those methods only output some prioritized failure symptoms such as the high CPU consumptions or disk usages. They do not provide further root causes of failures such as the broken hardware, configuration errors, and so on.
Accordingly, a need exists for a method and system that provides the root cause of failures occurring in large scale computing systems.