There is no optimal solution for the problem of alarm correlation and problem root cause analysis that has been developed towards for optical transport networks that utilise SDH and DWDM. The approaches that have been proposed towards solving this problem seem best tailored for other application domains, like IP based networks, for example, or are generic from an architectural perspective, lacking the means to introduce architectural components that support specific behaviour of some equipment types.
In U.S. Pat. No. 5,528,516 there is described an apparatus and method for determining the root cause (i.e. the source) of a problem in a complex system such as a computer network. The problem identification process described in this document is split into two separate activities of (1) generating efficient codes for problem identification and (2) decoding the problems at run time. A causality matrix is created which relates observable symptoms to likely problems in the system. This causality matrix is reduced into a minimal codebook by eliminating unnecessary information. Observable symptoms are monitored and problems decoded by comparing the observable symptoms against the minimal codebook using best-fit approaches. A Hamming distance measure between symptoms and codes in the codebook is defined and the set of reference symptoms that is closest to the observed symptoms, is selected, and the problem associated with this symptom set is proposed as being the probable cause of the actual observed symptoms.
This approach is not very flexible when it is necessary to deal with multiple simultaneous problems. In this case, there may be coincidence of symptoms from different problems, which will overlap. A more elaborated algorithm than a distance measure might be needed in this case.
This approach does not deal with simultaneous failures and as such it would not deal with the example presented below.
In WO 02/33980 there is described a topology based reasoning apparatus for root cause analysis of network faults. A root cause analysis system operates in conjunction with a fault management apparatus. The system includes a topology based reasoning system operative to topologically analyse alarms in order to identify root cause of the alarms. The system is based on topological network information and fault propagation rules. The topology is translated to a graph onto which incoming alarms and expected alarm behaviour are coordinated. The system's operator must provide the rules.
In this approach, the root cause decision is based on three parameters: 1) the distance in the network of the suspected root cause and the point of origin of each alarm generated by it, 2) the number of alarms in the incoming group that are explained by that root cause and 3) the number of alarms out of all alarms that the system expects for that root cause. When the root cause can not be pinpointed solely on the basis of the rules an expert system is used.
It is not straight forward to use this approach in complex networks such as SDH/DWDM networks, which have a high number of layers and vendor specific idiosyncrasies of equipment types. This is because the operator must define the fault propagation rule set and this is a difficult task when dealing with complex networks. Only occasionally do network operators have all the necessary know how to accomplish this.
Furthermore, using such a distance criteria may not be the best for a network which is modelled as a layered network, like an SDH/DWDM network, because alarms propagate transparently across lower layers.
U.S. Pat. No. 5,946,373 describes a topology based fault analysis system for use in telecommunications networks. The system correlates alarms and infers the root cause of a problem based on the topological configuration of the network. The U.S. Pat. No. 5,946,373 system uses truth tables or a rule-based inference engine or a combination of both for this purpose. This approach has potential problems. The rules and truth tables must be made mutually exclusive so that only one will be found to be true. The truth tables and rules must also be designed in such a way that changing one of their entries does not require changing another. Therefore, this apparatus requires that the rules and/or truth tables be ordered in a “most significant” result fashion. That is, conditions that are considered to be the most important are analysed first, leaving the lesser important faults for a later analysis should one be required. This is a cumbersome approach because it involves a judgement of which problems are the most important. The addition of new rules or truth tables may also lead to a reordering of the results. In effect, the operator is required to know which rules makeup the system and their level of importance in order to take advantage of the system.
None of the systems described in the prior art is particularly suitable for supporting a layered transport network, like an SDH network.