The complexity of computer networks continues to grow, and the levels of reliability, availability and service required of these networks continue to rise, as well. These factors place an increasing burden on diagnostic systems that are used in computer networks to identify and isolate network faults. In order to avoid failures that may severely interfere with network activity, it is important to detect intermittent and sporadic problems that are predictive of incipient failures and to pinpoint the devices that are causing the problems. To maintain high availability of the network, these problems should be identified while the network is on-line and running in a normal activity mode. Service personnel can then be instructed to replace faulty elements before they fail completely.
Modern networks typically provide large volumes of diagnostic information, such as topology files, system-wide error logs and component-specific trace files. Analyzing this information to identify network faults is beyond the capabilities of all but the most skilled network administrators. Most automated approaches to network diagnostics attempt to overcome this problems by framing expert knowledge in the form of if-then rules, which are automatically applied to the diagnostic information. Typically, the rules are heuristic, and must be crafted specifically for the system to which they are to be applied. As a result, the rules themselves are difficult to devise and cannot be applied generally to all error conditions that may arise. Such rules are not globally applicable and must generally be updated when the system configuration is changed.
Model-based diagnostic approaches begin from a functional model of the system in question and analyze it to identify faulty components in the case of malfunction. Functional models (also known as forward or causal models) are often readily available as part of system specifications or reliability analysis models. Developing such a model is typically a straightforward part of the system design or analysis process. Thus, creating the model does not require that the designer be expert in diagnosing system faults. Rather, automated algorithms are applied to the functional model in order to reach diagnostic conclusions. As long as the system model is updated to reflect configuration changes, these algorithms will automatically adapt the diagnostics to the changes that are made.
Switched computing and communication networks, such as System Area Networks (SANs), pose particular challenges for diagnostic applications in terms of their complexity and inherent uncertainties. The complexity has to do with the large numbers of components involved, the existence of multiple, dynamic paths between devices in the network, and the huge amount of information that these networks carry. The uncertainties stem, inter alia, from the fact that alarm messages are carried through the network in packet form. As a result, there may be unknown delays in alarm transmission, leading to alarms arriving out of order, and even loss of some alarm packets.
One paradigm known in the art for model-based diagnostics in the presence of uncertainty is Bayesian Networks. Cowell et al. provide a general description of Bayesian Network theory in Probabilistic Networks and Expert Systems (Springer-Verlag, N.Y., 1999), which is incorporated herein by reference. A Bayesian Network is a directed, acyclic graph having nodes corresponding to the domain variables, with conditional probability tables attached to each node. When the directions of the edges in the graph correspond to cause-effect relationships between the nodes, the Bayesian Network is also referred to as a causal network. The absence of an edge between a pair of nodes represents an assumption that the nodes are conditionally independent. The product of the probability tables gives the joint probability distribution of the variables. The probabilities are updated as new evidence is gathered regarding co-occurrence of faults and malfunctions in the system under test. When the diagnostic system receives a new alarm or set of alarms, it uses the Bayesian Network to automatically determine the most probable malfunctions behind the alarm.
U.S. Pat. No. 6,076,083, whose disclosure is incorporated herein by reference, describes an exemplary application of Bayesian Networks to diagnostics of a communication network. The communication network is represented as a Bayesian Network, such that devices and communication links in the communication network are represented as nodes in the Bayesian Network. Faults in the communication network are identified and recorded in the form of trouble tickets, and one or more probable causes of the fault are given based on the Bayesian Network calculations. When a fault is corrected, the Bayesian Network is updated with the knowledge learned in correcting the fault. The updated trouble ticket information is used to automatically update the appropriate probability matrices in the Bayesian Network. The Bayesian Network of U.S. Pat. No. 6,076,083 is static and makes no provision for changes in the configuration of the communication network. Furthermore, because the Bayesian Network models the entire communication network, it quickly becomes computationally intractable when it must model a large, complex switched network.
Another approach to the application of Bayesian Networks to fault diagnosis in computer systems is described by Pizza et al., in “Optimal Discrimination between Transient and Permanent Faults,” Proceedings of the Third IEEE High Assurance System Engineering Symposium (1998), which is incorporated herein by reference. The authors suggest applying the principles of reliability theory to discriminating between transient faults and permanent faults in components of a computer system. Reliability theory predicts the probability of failure of a given device in terms of failure rate or failure distribution over time (in terms such as Mean Time Between Failures—MTBF). Standard reliability theory techniques are based on sampling device performance under known conditions. In the scheme proposed by Pizza et al., on the other hand, the probabilities of permanent and transient faults of the system components are estimated and updated by inference using a Bayesian Network. This scheme is of only limited practical applicability, however, since in order to arrive at an exact and optimal decision as to failure probability, it looks at each module in the computer system in isolation, without error propagation from one module to another. This is not an assumption that can reasonably be made in real-world switched networks.