In software systems that interact with real-world hardware components, errors are often difficult to trace to a root cause. An error encountered by software interacting with one hardware component may cause a number of subsequent follow-on errors in other components or software modules interacting with these hardware components. In a software/hardware system that is capable of recording or displaying errors, multiple errors may be generated across a variety of internal subsystems in response to a root problem with the hardware or software module that encountered a problem. Displaying these multiple related errors and associating them in a meaningful way can be a difficult problem.
In order for an operator or service engineer to successfully diagnose and resolve the overall problem in the system, the operator needs to determine the originating root cause error and treat it. The follow-on errors that may be generated from a root cause error are generally not important when diagnosing and treating the overall problem. However, in many prior art systems, the resulting errors are displayed to the operator, regardless of whether they are a likely root cause or not, in an unarranged, undistinguished fashion alongside the original root cause error. This problem may even exist where the systems have a means for associating errors.
There is often no simple way for the operator or engineer to distinguish which error is the root cause and which errors are follow-on errors (e.g., those that are often less important from the stand point of solving the problem—merely symptoms that result from the root cause error). The operator often needs to guess the root cause error based on other information such as error timing, experience, complex log files, or luck. This guesswork can be time-consuming and costly for the operator and/or service engineer and is often detrimental to the reliability of the hardware/software system.
In some systems, certain errors may be considered more critical than others. For example, an error that results in a total system shutdown and significant loss of productivity may be considered more severe than a simple timing error that results in a minimal loss of productivity. However, in many prior art systems, the overall severity of the root cause error may not be known until the follow-on errors are generated and manually correlated by the operator back to the root cause error. Particularly in systems where errors of lesser initial severity can be automatically hidden from the operator, it may be difficult to determine that an error that was hidden due to low severity is actually a root cause error of significance if it is not associated with more severe follow-on errors. For example, it may not be possible for the instrument to display a particular root cause error to the operator if the system does not determine the error is severe enough. Many prior art systems typically lack knowledge about the causal connections within a group of errors and this frequently prevents the true severity of the originating error from being known.
There is a need for a system that automatically determines the causality between a group of errors. Such a system that reliably solves this problem has previously been considered extremely difficult if not impossible. While conventional systems can easily determine causal links in some cases, other cases were considered too difficult or had no known solution. In order for error-causality systems to be useful, they must be able to determine the common causalities that occur. This may be difficult if the software in the system has not been designed with this goal in mind. In particular, establishing the causality of software thread time-out errors (e.g., a software thread timing out on a lock held by another thread that is processing a different error) has been a difficult problem in system design.
An example of a system where this need exists can be found in U.S. Pat. No. 7,381,370, assigned to the assignee of the present application and incorporated herein by reference. In complicated instruments, such as chemical analyzers, which may include a number of precisely moving parts, a root cause error may be simple in the real world but difficult to determine in software. For example, if a servo motor has become worn or stuck, it may result in errors in other mechanical portions of the instrument that interact with the motor. There is a specific need in chemical analyzers, medical devices, or other complex software/mechanical instruments to provide software mechanisms that simplify, repair, and diagnose when errors occur in the system.
The present invention provides software mechanisms that facilitate the association of errors to assist in the determination of root cause errors versus follow-on errors. The present invention also provides software mechanisms to facilitate simplified displaying of errors to assist operators in determining root cause errors.