Failures are unhandled conditions in an application. Failure diagnosis is the process of discovering the root cause of occurred failures based on a set of observed failure indications in the system. Fast and accurate diagnosis is essential to maintain the high availability of current computing systems. The study of failure diagnosis in computing systems has gone on for quite a long time. Traditional approaches rely on profound understandings of the underlying system architecture and operational principles to build system models or a set of rules for the diagnosis. As the increasing complexities of current computing systems, however, it becomes hard to build a meaningful model or precise rules to facilitate the failure diagnosis. As an alternative, statistical learning based approaches received more attentions in recent years. Those methods identify the failure root cause by analyzing and mining a large amount of monitoring data collected from the failure system to characterize the failure behavior. However, those methods only output some prioritized failure symptoms such as the high CPU consumptions or disk usages.
If an error condition arises but it is not detected and handled by a computer system gracefully, it is not categorized as a failure. The non-gracefully handled errors form the failures of the system. Many of these are explicitly observed in console or field failure logs like crash, hang or errors; while others are silent failures that could lead to silent data loss, silent data corruption or incorrect execution of workflow. It is important that any technique that is used for field failure analysis should not just indicate anomalies in the system but should be able to help human operators to localize these problems for faster diagnosis.