As is known, the latest-generation telecommunications networks and the services offered thereby, which are increasingly based upon the Internet protocol (IP), are a combination of technologies, network apparatuses and different functions (access, transport, control, service, content server), and in particular are composed of mutually interconnected apparatuses located at different protocol levels and using layering and client-server concepts for providing the network services to the customers. In this context, ensuring continuous provision of a service to customers, and preventing any problems that might be perceived by the customers, is one of the main tasks of an operator of telecommunications networks and services.
This task involves gathering and processing alarms arriving from the network apparatuses, suppressing alarms that are meaningless for the subsequent operations or that are redundant, automatically associating alarms generated by the same network resource (e.g. an operation re-establishment alarm with a related fault alarm), correlating received alarms relating to the same fault and generated by different apparatuses in the same or in a different domain (e.g. alarms received on transport apparatuses and alarms received on networking apparatuses), identifying faults in the network (supervision) based on the gathered alarms, identifying and analyzing the causes of the faults in a such a way as to arrive to the so-called “root cause” (diagnosis), assessing impact of the faults on the supported services, and finally undertaking all the actions necessary for repairing the faults (correction). By “fault” it is here meant an anomalous operating condition of a network apparatus or of one of its components due to an effective failure, whether hardware or software, or performance degradation thereof, that triggers alarms sent by the network apparatus or by other directly or indirectly interconnected apparatuses.
The problem of identifying the root cause based on the alarms is fundamental for creating an effective and automated fault management system. However, the capacity to aggregate alarms originated by different network apparatuses, to discriminate important information from redundant one and to identify the cause(s) of the faults is difficult to implement in an automated manner. Alarm correlation and fault identification are the most difficult to implement in an automated manner because it is fundamental the knowledge of the network apparatuses, in particular of the hardware architecture of the network apparatuses and the topological relationships between the various physical and logical components of both the network and the network services offered. The large numbers of network apparatuses, network resources, and their relationships is such that the management of asynchronous information, as alarms are, is extremely complex. The large number of cases that can be encountered also makes particularly critical maintenance and updating of the information (codes, rules and case tables) necessary to associate the alarms with one another and recognize the fault that generated them.
In the field of alarm correlation and root cause identification, a proposal has been made by Dilmar Malheiros Meira, “A Model For Alarm Correlation in Telecommunications Networks”, November 1997, PhD. Thesis in Computer Science, Institute of Exact Sciences (ICEx) of the Federal University of Minas Gerais (UFMG). In particular, this thesis proposes a general model for telecommunications networks and, from this model, it proposes a model for alarm correlation in the network as a whole. The model is based on a principle named “recursive multifocal correlation” according to which a telecommunications network is partitioned into several sub-networks, each constituting a correlation focus. The breakdown of the problem into smaller sub-problems facilitates its solution and allows the use, in each sub-network, of the correlation technique most suitable to its peculiarities. The multifocal correlation principle may be recursively utilized in each sub-network until the network element level is reached. The concepts developed have been utilized in the implementation of a prototype, used for alarm correlation in a canonical telecommunications network. By utilizing a commercial product as a tool for the development and evaluation of Bayesian networks, the occurrence of alarms has been simulated and the functioning of the model has been verified, both concerning the identification of the possible causes for the received alarms (diagnostic inference), and the prediction of possible effects (predictive inference).
Space is also given in academic literature (see for example the article of M. Steinder and A. S. Sethi “Probabilistic Fault Localization in Communication Systems Using Belief Networks”, IEEE/ACM Transactions on networking, vol. 12, No 5, October 2004) to applications of Bayesian networks, or Belief networks, which address the problem of alarm correlation by identifying the causality relationship between network faults and alarms, introduce probabilistic relationships between events, and make extensive use of the concept of conditioned probability.
The use in diagnostic systems of a Bayesian network model having link weights updated experientially is proposed in U.S. Pat. No. 6,076,083, which discloses an algorithm for easily quantifying the strength of links in a Bayesian network, a method for reducing the amount of data needed to automatically update the probability matrices of the network on the basis of experiential knowledge, and methods and algorithms for automatically collecting knowledge from experience and automatically updating the Bayesian network with the collected knowledge. A practical exemplary embodiment provides a trouble ticket fault management system for a communications network. In the exemplary embodiment, a communications network is represented as a Bayesian network where devices and communication links are represented as nodes in the Bayesian network. Faults in the communications network are identified and recorded in the form of a trouble ticket and one or more probable causes of the fault are given based on the Bayesian network calculations. When a fault is corrected, the trouble ticket is updated with the knowledge learned from correcting the fault. The updated trouble ticket information is used to automatically update the appropriate probability matrices in the Bayesian network.