1. Field of the Invention
The present invention generally relates to a causal model for problem determination and a method of capturing the relationship between causes and error messages for local area network (LAN) systems to be managed and analyzed and, more particularly, to a causal model which represents problem solving knowledge and represents relationships between causes and error messages using a limited multi fault approach. The causal model can be used by an inference engine in an expert system for diagnostic reasoning and analysis and correlation of error messages.
2. Description of the Prior Art
It is common for computer systems, in particular local area networks (LANs), to have numerous error events, the majority of which require different messages to be sent to the user and the occurrence and performance of many different complex actions for recovery. These errors result from a variety of conditions, including configuration errors, hardware errors and communication errors.
At present, error analysis and problem resolution is often handled manually by LAN administrators. There are two problems with this approach. The first is that the error messages often contain vague or incomplete information. An example of this would be the error message "internal software error". The administrator must then decipher the error message or perform additional work to determine the actual cause of the error. The second problem with manual error code resolution is that one problem can often generate multiple error messages, especially in a LAN system. Therefore, the LAN administrator is often overwhelmed by the number of errors that need to be analyzed. Furthermore, the analysis and review of errors is knowledge intensive. Therefore, it has been difficult to implement a non manual method or system for managing error messages.
Some attempts have been made in the past to implement an error manager; however, these have been unsuccessful due to the large amount of information which must be stored and the knowledge required. In some cases, error managers have been implemented with complicated in-line code which is called after an error event is recognized. Other implementations have used "table driven" error management. However, since each error event can have many action codes and each unique error event/action code pair must be represented, this system was inefficient in representation and storage. Furthermore, none of these methods provide a system which enables users to modify the error handling method.
Error management requires problems and causes to be correlated so that information regarding the error can be analyzed and provided. At present, most problem determination systems use the single fault assumption wherein only a single fault can exist in a system at one time and that fault is associated with a single cause. The single fault assumption could be used with a complex network system or computer system. Other complex systems use the multi fault assumption, however, this is computationally too expensive for a real time system. Therefore, it could not provide error information in real time.