Root cause analysis in communication networks typically involves determining the actual fault or problem that causes a network outage, alarm or event. A single fault in a network usually generates a plurality of event or alarm messages relating to a plurality of links connecting network devices and to the devices themselves. A network monitoring device typically receives a plurality of messages and tries to determine from the messages the location of one or more faults in the network. In addition, an effort is made to associate the messages received with faults that are identified. In this way, an engineering decision can be made prioritizing faults based on the severity of a fault, e.g., the type and number of messages associated with a particular fault typically indicates its severity.
Known root cause analysis methods typically determine the ultimate cause or fault in a network based on a known network topology. For example, a map of the network that includes all the nodes in the networks and the links between the nodes is typically maintained. When messages are received by a network monitoring device, the device then performs a root cause analysis based on the network topology and the messages receive. U.S. Pat. No. 6,604,208 to Gosselin, et al., (“the '208 patent”) is exemplary of such schemes. In the '208 patent the hierarchical nature of a network is used to correlate alarm events. Over time alarms are processed in view of an historical context to determine instances of correlation such that alarms are partitioned into correlation sets where the alarms within one set have a high probability of being caused by the same network fault. Schemes such as that employed in the '208 patent, however, loose much of their utility in a network where hierarchical relationships within the network do not remain constant.
More particularly, in networks where hierarchical relationships do not exist between network devices or where the hierarchical relationships in the network change dynamically in response to faults or other events the meaning of alarm or event messages that are generated also change dynamically.
A network employing multi-protocol label switching (MPLS) is exemplary of networks where network topology or hierarchy alone cannot be relied on to perform root cause analysis. In an MPLS network data transmission occurs on label-switched paths (LSPs). LSPs are defined by a sequence of labels that are distributed at each node along a path from a source to a destination. LSPs may be established either prior to data transmission (control driven) or upon detection of a certain flow of data (data-driven). The labels, which are underlying protocol-specific identifiers, may be distributed using label distribution protocol (LDP) or RSVP or piggybacked on routing protocols such as border gateway protocol (BGP) and OSPF. The labels are of fixed-length and are inserted at very beginning of a packet or cell. The labels may be then used by nodes in the network to switch the packets or cells between links coupled to the switching node.
An LSP may be established using either hop-by-hop or explicit routing. Hop-by-hop routing is similar to that used in IP (Internet Protocol) networks. In particular, in hop-by-hop routing, each label switched router (LSR) independently selects the hop for each label switched packet. In explicit routing, an ingress LSR (i.e., an LSR where data flow originates) specifies the list of nodes through which data will flow. Explicit routing may also be strict or loose. A strict explicitly routed label switched path follows a list of nodes using the actual addresses of each node that is to be traversed, while a loose explicitly routed label switched path is more adaptive and allows groups of nodes, specified as an autonomous system number, to act as one of the nodes that may be traversed.
In an MPLS network the path that data takes through the network changes dynamically in response to failures or repairs. For example, a failure on a first LSP may preempt service on a second LSP path because the first LSP was granted a higher priority than the second LSP. A device monitoring the network may receive a plurality of event status messages from the different nodes that are affected by the failure. The event status messages are in the form of traps and include information identifying the LSP and the status of the LSP. The traps, however, do not generally include information that would indicate any relationship between the different event messages or traps.
Of utility then are methods and systems for correlating events or event messages in MPLS-type networks and for determining a root cause of the events or event messages.