The present invention relates generally to diagnosis of fault conditions in computer systems, and specifically to fault diagnosis using error log analysis.
Because of the increasing complexity of computers and computer-based systems, system administrators and maintenance personnel generally do not have sufficient knowledge and expertise to diagnose all of the faults that can occur in these systems. A variety of diagnostic tools have been developed in order to help in identifying the cause of such faults and determining the corrective action that must be taken. These tools generally receive and analyze error reports from different system components. In its most basic embodiments, the analysis is based on simple, pre-programmed xe2x80x9cif-thenxe2x80x9d rules. More sophisticated tools have been developed that use techniques such as artificial intelligence, expert systems, neural networks and inference engines. Tools of this sort are described, for example, in U.S. Pat. Nos. 4,633,467, 4,964,125 and 5,214,653, whose disclosures are incorporated herein by reference.
In many computer systems, a system error log stores a record of all of the error reports that are received from system components. The error log is supposed to be used by the system administrator or maintenance engineer in tracing and understanding faults that have occurred. The number of errors in the log can be very large, however, and with the exception of a few patterns that the system administrator may recognize from experience, the error log generally provides no clue as to the source of the error or how to solve it. At best, an enterprising system administrator may be able to find faults that are relatively straightforward by looking up error codes from the error log in a system maintenance manual. In more complex cases, the system administrator may not even be able to determine whether the entries in the error log are due to a hardware fault or to a software problem.
U.S. Pat. No. 5,463,768, whose disclosure is incorporated herein by reference, describes a method and system for automatic error log analysis. A training unit receives historical error logs, generated during abnormal operation or failure of machines of a given type, together with the actual repair solutions that were applied to fix the machines in these circumstances. The training unit identifies and labels sections, or blocks, within the error logs that are common to multiple occurrences of a given fault. These blocks are assigned a weight indicative of their value in diagnosing the fault. A diagnostic unit receives new error logs associated with abnormal operation or failure of a similar machine, and compares the new error logs to the blocks identified by the training unit. The diagnostic unit uses similarities that it finds between blocks in the new error log and the identified historical blocks to determine a fault diagnosis and suggested solution. The solution receives a score, or similarity index, based on the weights of the blocks.
It is an object of the present invention to provide improved methods and apparatus for diagnosing faults in a computer system.
It is a further object of some aspects of the present invention to provide methods and apparatus that assist the operator of a computer system in understanding and repairing faults that occur in the system.
It is still a further object of some aspects of the present invention to provide improved methods and apparatus for analysis of an error log generated by a computer system.
In preferred embodiments of the present invention, an error log analyzer (ELA) scans error logs generated by a computer system. The logs are preferably generated whenever the system is running and are analyzed by the ELA at regular intervals and/or when a fault has occurred. The ELA typically comprises a software process running on a node of the computer system. Alternatively, the ELA may comprise dedicated computing hardware.
The ELA processes error log data in three stages:
A selection stage, in which the ELA determines, for each error in the log, whether the error is of relevance to fault conditions of interest. Relevant errors are held for further processing, while irrelevant errors are discarded.
A filtering stage, in which certain errors are composed, i.e., filtered and grouped together, into events, which are known to be associated with particular fault conditions.
An analysis stage, in which the events are checked in order to decide whether their numbers and types are such as to indicate that a fault exists that requires service attention. If so, the problem and, preferably, suggested solutions are reported to a system operator.
At each stage, the ELA processes the errors or events in accordance with predetermined decision criteria. The criteria are expressed in terms of parameters, which are preferably held in suitable tables. Unlike diagnostic systems known in the art, such as expert systems and neural networks, the tables can be edited and updated by development and support personnel, based on field experience with the system and on the particular operating conditions and requirements to which a given system is subjected. The tables can also be copied from one computer system to another. Thus, the present invention provides a tool for fault diagnosis that can be made to identify and offer solutions to an essentially unlimited range of errors appearing in the error log, based on decision criteria that are accessible for adjustment and modification by users in a straightforward manner.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for diagnosing faults in a computer-based system, including:
reading a log of errors of different kinds that have been recorded in the system;
selecting from the log errors of those kinds that are relevant to one or more predetermined types of faults that can occur in the system;
filtering the selected errors so as to compose one or more events, each event including one or more occurrences of one or more of the relevant kinds of the errors; and
analyzing the composed events to reach an assessment that at least one of the predetermined types of faults has occurred.
Preferably, selecting the errors includes providing a respective callback function for each relevant kind of error, wherein the callback function analyzes data in the error log associated with the error in order to determine whether the error should be selected.
Further preferably, filtering the selected errors includes filtering the errors according to filtering conditions specified in a filtering table, each filtering condition specifying a set of errors required in order to compose one of the events. Most preferably, selecting the errors includes selecting from the log those errors that are known to belong to the set of errors associated with one or more of the filtering conditions. In a preferred embodiment, the set of errors required in order to compose one of the events includes multiple occurrences of one of the kinds of errors or, additionally or alternatively, one or more occurrences of each of a plurality of the kinds of errors. Preferably, the filtering condition specifies a maximum time lapse during which all of the plurality of the errors must occur in order for the condition to be satisfied. Additionally or alternatively, the filtering table further specifies a level of severity for at least some of filtering conditions, and filtering the selected errors includes applying the filtering conditions to the errors in the error list in order of the level of severity of the conditions.
Preferably, filtering the selected errors includes removing errors that have been used in composing one of the events from the error list, whereby any given error is not used to compose more than a single event. Most preferably, removing the errors from the error list includes removing both errors specified as being required to compose a given one of the events and errors specified as being associated with the given one of the events but not required to compose it.
Further preferably, analyzing the composed events includes assigning the events to event sets specified in an event sets table, wherein each event set is associated with at least one of the predetermined types of faults. Most preferably, the event sets table specifies a number of instances of one or more of the events that must occur within a given time frame in order for the event set to be complete, and analyzing the composed events includes reaching an assessment that the type of fault associated with a given one of the event sets has occurred if the event set is complete.
Preferably, analyzing the composed events includes outputting a message to a user with the assessment that one of the predetermined types of faults has occurred with a specified probability that the assessment is correct. In a preferred embodiment, outputting the message includes indicating two or more of the predetermined types of faults that may have occurred, each indicated type with a respective, specified probability. In a further preferred embodiment, outputting the message includes indicating a component of the system that should be replaced. Preferably, reading the log of errors includes reading the error log automatically at predetermined time intervals, and outputting the message includes reporting the assessment to the user automatically, responsive to reaching the assessment that the fault has occurred.
There is also provided, in accordance with a preferred embodiment of the present invention, apparatus for diagnosing faults in a computer-based system, including an error log processor, adapted to read a log of errors of different kinds that have been recorded in the system, to select from the log errors of those kinds that are relevant to one or more predetermined types of faults that can occur in the system, to filter the selected errors so as to compose one or more events, each event including one or more occurrences of one or more of the relevant kinds of the errors, and to analyze the composed events to reach an assessment that at least one of the predetermined types of faults has occurred.
Preferably, the apparatus includes a storage device, in which the log of errors is recorded, wherein the error log processor is coupled to read the log from the storage device substantially automatically. Additionally or alternatively, the apparatus includes a memory, in which the processor stores one or more tables containing conditions according to which the error log is processed. Further additionally or alternatively, the apparatus includes a display, wherein the processor is coupled to output a message to the display with the assessment that one of the predetermined types of faults has occurred along with a specified probability that the assessment is correct.
There is further provided, in accordance with a preferred embodiment of the present invention, a computer program product for diagnosing faults in a computer-based system, the product including a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to read a log of errors of different kinds that have been recorded in the system, to select from the log errors of those kinds that are relevant to one or more predetermined types of faults that can occur in the system, to filter the selected errors so as to compose one or more events, each event including one or more occurrences of one or more of the relevant kinds of the errors, and to analyze the composed events to reach an assessment that at least one of the predetermined types of faults has occurred.
The present invention will be more fully understood from the following detailed description of the preferred embodiments thereof, taken together with the drawings in which: