In modern communication systems, network elements are equipped with fault management functionalities that involve the generation of an alarm upon detecting some malfunctioning. When a fault occurs in a communication entity of a network element (e.g., in a protocol layer), the services provided by the communication entity may be degraded or blocked completely. As a result, other communication entities of the network element and/or other network elements relying on such services will also exhibit fault symptoms and may start generating alarms themselves. Consequently, one single fault may propagate through a larger part of the communication system and give rise to a high number of correlated alarms.
FIG. 1 schematically illustrates the propagation of a fault through different protocol layers (L1 to L7) of an individual network element as well as through different network elements (NE1 to NE3). In FIG. 1, it is assumed that the fault occurs in L2 of NE3 and that an alarm will thus be generated by L2. Due to this fault in L2, the services provided by L2 to L3 of NE3 will be blocked, so that also L3 of NE3 will not be able to function properly (as it relies on the services of L2). Thus, L3 of NE3 will generate an alarm itself. The same mechanism will happen between L3 and L4 of NE3, and between all higher protocol layers of the protocol stack of NE3. Consequently, the fault that has occurred in L2 propagates “upwards” in the protocol stack of NE3. Such a fault propagation within a single network element will in the following be referred to as “vertical” fault propagation (of course, the fault could also vertically propagate “downwards” in the protocol stack).
A fault in an individual network element such as NE3 could also propagate to one or more network elements on a peer side of the faulty network element as illustrated in FIG. 1 for network elements NE1 and NE2. As shown in FIG. 1, the fault in L2 of NE3 will result in a malfunctioning of the corresponding L2 of the peers NE1 and NE2 of NE3, so that NE1 and NE2 will generate alarms themselves. Such a fault propagation among communicating network elements will in the following also be referred to as “horizontal” fault propagation. Is should be noted that the horizontal fault propagation may in turn give rise to a vertical fault propagation as illustrated in FIG. 1 for NE1.
The purpose of alarm correlation is to find a relation between alarms that are caused by the same fault (“root fault”) and trace back the alarms to the alarm (“root alarm”) generated in direct response to the root fault. Since in larger communication networks hundreds of alarms may be active in parallel in any given moment, it is not an easy task to identify the one or more root alarms in a long alarm list. It should also be noted that the temporal order of the alarms typically does not correspond to the logical order in which the correlated faults have occurred. This lack of correspondence can be attributed to different triggering thresholds for alarm generation in different communication entities (in the example of FIG. 1, the L2 alarm in NE3 may be preceded by the L3 alarm in NE3) and in different network elements (in the example of FIG. 1, the L2 alarm of NE1 may be preceded by the L2 alarm of NE2).
To find a correlation among a plurality of alarms, the content of the respective alarm messages can be analyzed. Alarm message specifications are available for a large number of different communication systems. For communication systems according to the 3rd Generation Partnership Project (3GPP), the content and format of alarm messages is defined, inter alia, in Technical Specification (TS) 32.111-2 V.10.0.0 (2010-12); Fault Management; Part 2: Alarm Integration Reference Point (IRP): Information Service (IS). In section 5.3.1.2 of this TS, different alarm attributes that may be signalled in an alarm message are listed. The alarm attributes include information about the time when the alarm was raised, about a probable alarm cause and about proposed repair actions. However, the information derivable from the alarm attributes is confined to the network element reporting the alarm. As a result, it is rather difficult to find relations among alarms generated by different network elements.
WO 2006/057588 A1 discloses a technique for correlating alarms generated by different network elements that have a client-server-relationship. A serving network element that locally detects a fault generates a Fault Identifier (FID) in the form of a randomly generated number. The faulty serving network element reports the resulting fault together with the FID via a first alarm message to a network management system. Additionally, the serving network elements informs its client network elements of the service loss or degradation via traffic messages to which the FID is appended. Each client network element extracts the FID from the traffic message and appends it to a further alarm message that is also sent to the network management system. Since the same FID is reported to the network management system via alarm messages generated by the faulty serving network element on the one hand and the client network elements effected by the fault on the other, the network management system can correlate the resulting alarm messages.
One drawback of the correlation approach presented in WO 2006/057588 A1 is the fact that it requires signalling for each fault via dedicated traffic messages between the serving network element and the served network elements to propagate the FID. Additionally, there has to be pre-established client-server-relationship so as to permit the serving network element a determination of the client network elements that need to be contacted via traffic messages in case of a fault.