The approaches described in this section could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
Network administrators need effective tools for determining what problems have occurred in computer networks, especially large, complex networks of the type owned and operated by service providers. Many existing solutions for event correlation and generating network diagnostics are not suitable for real-time monitoring and on-line diagnosis, because they require complex computational models and are computationally expensive. Operation support systems (OSS) commonly incorporate topology models and event models that add to the computational costs, which can lengthen the time for the OSS to provide feedback to an administrator, increasing the time at which corrective action is taken.
Further, many existing systems inadequately track event information structures that are communicated by network elements. For example, a network element may emit multiple events associated with a single problem, but due to routing complexities and network latency, the events arrive at the OSS at different times or out of order. Accounting for temporal dependencies and considering event reordering issues imposes challenging responsibilities on such systems. In a typical approach, when an OSS receives one event, a plurality of problem diagnoses may be possible, and the OSS determines a single diagnosis only when a specified set of events arrives in a specified order. Thus, a single event {e1} may lead to numerous diagnoses, while a full set {e1, e2, e3, e4} precisely identifies the faulty element. Until the full set of events is received, the OSS cannot be used to isolate a network problem.
The failure of these systems to keep track of temporal dependencies and appropriately handle event reordering is one of the main reasons for developing customized event correlation in an OSS. Solutions that do not require complex or rich models and provide a lightweight implementation, potentially suitable for implementation inside the network element, are better.
Further, the geographic distribution of network elements in a network may introduce a variable delay, making the accuracy of event patterns that are tightly related to time particularly inappropriate for real-time monitoring and diagnosis. In networks that guarantee clock synchronization for validating the temporal relationship of events, time-based relationships can be effectively used for event correlation.
Delay and Internet Protocol (IP) routing mechanisms may introduce event reordering, because event packets may follow different paths to reach their destination. In such networks, the relative ordering of the events is no longer guaranteed. For networks that guarantee bounded time delays and guarantee correct event ordering, then the concept of progressive patterns for event correlation can be used.
However, many networks have desynchronized sub-network behavior, uncontrollable delays, and event reordering. For these networks, approaches based on temporal relationships and progressive patterns are no longer useful; other mechanisms are needed to evaluate and diagnose network behavior.
In some approaches, dependencies among symptoms and diagnoses are captured through policies. Policies express a logical diagnosis under known conditions of topology, event delivery, and network transport properties. As topology (or configurations of the logical interactions) may also change, the mapping rules of symptoms and diagnosis must be revised.
A network diagnosis is a possible hypothesis about faulty components in the network. A diagnosis may be passive or active. Model-based passive diagnosis systems collect information and analyze it. Many approaches have been used to analyze information, e.g., Bayesian networks, Petri Nets, artificial Neural Networks, rule-based methods, model-based networks, etc. Active diagnosis systems apply additional tests to the results of the passive diagnosis.
Diagnosing network problems is a very time-consuming activity. Therefore, having performance-oriented knowledge-based methods and mechanisms to speed-up the diagnosis would be beneficial.
In one class of prior approaches to this problem, topology-dependent and model-based correlation and diagnosis processes, using root-cause analysis, have been implemented. For example, InCharge from Smarts, NetCool from MicroMuse, and OpenView from Hewlett-Packard implement these approaches. These solutions are mainly based on dependency models and topology definition and discovery of network elements and/or applications. These mechanisms are intended for out-of-the-box processing and require considerable CPU power and memory.
In another approach, network problems, symptoms and diagnoses are defined in a rule-based markup language (RBML). The markup language is also used to define rules that specify when a particular diagnosis is indicated by one or more symptoms. RBML is described in co-pending U.S. application Ser. No. 10/714,158, filed Nov. 13, 2003, of Keith Sinclair et al. RBML is primarily a language and environment in which to execute rules implemented in that language. It is a mechanism that operates on the knowledge of network behavior in the form of a set of rules but it does not impose any specific model of network behavior. With RBML it is not possible to map a set of events to multiple possible diagnoses. RBML does not account for events that may arrive in any order. Further, all events defined in a rule must occur to trigger the action defined in the rule.