1. Field of the Invention
The present invention relates generally to root cause and corrective action, and more particularly to automated systems to facilitate root cause and corrective action.
2. Description of Related Art
With advances in technology, computer system product life cycles are getting shorter whereas the systems themselves are becoming more and more complex. At the same time, there is ever increasing pressure to reduce the time and cost involved in isolating and fixing any faults that are detected during the productive life of the system.
A defect/fault refers to deviation of the functioning of a component from desired behavior. Stimulation of a fault results in a failure and manifestation of a failure is called an error. In other words, an error is a symptom of a fault and is seen on the occurrence of a failure. As such a fault can exist in a system and stay undetected (absence of errors) until a particular kind of stimuli is applied that causes the failure.
Correlating errors to a fault is called diagnosis. Analyzing the cause of a defect/fault is called root cause analysis (RCA), an important first step in taking corrective action for improving product quality. Often times identifying the factors (stimuli) that impact the time to failure can lead to the timely root cause and corrective action (RCCA).
The need for timely root cause and corrective action is well understood. Those who are tasked with conducting root cause and corrective action also know how important it can be to reproduce a failure to identify the stimuli of interest. At the same time, it can be difficult to reproduce certain transient faults that may have been stimulated by a specific (yet unknown) set of stimuli. Such stimuli can consist of specific test algorithms, data-patterns, addressing sequences, environmental conditions etc., or a combination thereof.
To make matters complicated, most of the hardware errors in today's systems are reported asynchronously. This is true for correctable and non-fatal errors, both of which may be indications of an incipient fault. For example, in case of ECC (Error Correction Code) protected memory, every time the processor reads a memory location, the processor checks for the correctness of the data against the ECC code previously stored during the write operation. On detecting a single bit upset, the processor may transparently provide the corrected data to the application (possible stimulus) that requested the data. While the processor may also generate a trap to report the error event, by default this is totally transparent to the application. As such, the application that stimulated the failure is not even aware of the error.
One can see how such mechanisms are required and useful from normal user/customer application's point of view. At the same time, this means that special test applications, designed to stimulate errors have to deal with an extra hurdle to detect the occurrence of an error and log information about the activity that might have stimulated the fault. For failures, which are stimulated by a combination of different stimuli like temperature, voltage, signal noise etc., the process of error duplication and root cause analysis can be even more complicated.
Various tools have been developed for use in root cause and corrective action. Efforts to facilitate root cause and corrective action are not new. With the introduction of a Fault Management Architecture that provides Protective Self Healing in the Solaris 10 operating system, available from Sun Microsystems, Inc of Santa Clara, Calif., the operating system has taken on the onus of doing fault diagnosis and management.
With the Fault Management Architecture, a fault or defect in software or hardware can be associated with a set of possible observed symptoms, called errors. When an error is observed, an error report is generated. Error reports are encoded as a set of name-value pairs, described by an extensible protocol, forming an error event.
Error events and other data that can be gathered to facilitate automated repair of the fault are dispatched to diagnosis engines designed to diagnose the underlying problems corresponding to these symptoms. Diagnosis engines run in the background silently consuming telemetry until a diagnosis can be completed or a fault can be predicted. After processing sufficient telemetry to reach a conclusion, a diagnosis engine produces another event, called a fault event, which is broadcast to any agents deployed on the system that know how to respond.
A software component known as a Fault Manager in the Fault Management Architecture, which is implemented as daemon, manages the diagnosis engines and agents; provides a simplified programming model for these clients as well as common facilities such as event logging; and manages the multiplexing of events between producers and consumers.
Thus, the Fault Management Architecture does the fault diagnosis based on the error reports (e-reports). The Fault Manager collects the e-report(s) and utilizes the diagnosis engines to identify the action needed to “isolate” the impact of the fault from the rest of the working system. This action is based on a set of pre-defined rules that may involve diagnosing the error to find the exact fault. The primary goal is to deduce an actionable conclusion.
Whether the exact fault is identified, the action in general consists of identifying an Automated System Recovery Unit (ASRU) that can be disabled to isolate the impacts of the fault. The Fault Management Architecture does not attempt to collect telemetry about the stimuli that may have instigated the fault.
The action that Fault Management Architecture takes is the right first response by correcting the problem and keeping the system running. However, the more time consuming root cause and corrective action phase comes later and is not dealt with by the Fault Management Architecture.
A Continuous System Telemetry Harness (CSTH) from Sun Microsystems, Inc. records system environmental data (typically available via “showevn” command—component temperature, voltage levels etc.) as continuous time series signals. In addition, under certain circumstances CSTH aims to analyze the data so archived and predict failures that may occur in the future.
For example, predictions maybe based on voltage or temperature fluctuations historically known to be indicative of an incipient fault. Such prediction is typically possible (and useful) in cases where the degradation may appear hours or sometimes days in advance of failure.
For crashes that occur with no predictive warning, the CSTH is still often valuable because the archived telemetry data may be mined to identify signatures from variables that showed anomalies just prior to crash, thereby helping to mitigate No-Trouble-Found (NTF) events. As CSTH keeps a circular file of the captured telemetry, the information in the file can also be used to validate the functionality of various sensors (that monitor voltage and temperature) of the system.
Although CSTH implements an excellent ‘Black Box Flight Recorder’ for computer systems, CSTH does not automatically correlate asynchronous error events with the stimuli. Root cause analysis requires post-processing of significant amounts of data and a manual correlation of the stimulus with the telemetry readings (say, based on time stamps). While the circular log of all telemetry information is useful in some situations (as explained above), such a file is not well-suited for correlating error occurrences with test stimuli because some the needed stimuli may have been overwritten.
Test suites, like SunVTS™ diagnostic tool, can have multiple test processes running concurrently on the system and different tests may start and complete asynchronously. Furthermore, such tests can generate massive amounts of messages (telemetry) about test progress, patterns and algorithm/logic being executed. This can lead to huge logs in a short amount of time and obviously makes it difficult to manually correlate exactly which test processes were running at the time of error. Thus, while these various systems represent significant advances, root cause and corrective action still requires manually combing through extensive logs that may or may not contain the information about the stimuli that resulted in the error.