Telecommunication networks have increased dramatically in size and complexity in recent years. A typical network may consist of hundred of nodes, with network equipment being supplied by numerous manufacturers, each with different traffic and bandwidth requirements. This increase in complexity presents serious problems of network management and control. One aspect of network management is fault management, and an essential component of fault management is fault identification. Unfortunately, failures in large communication networks are normally unavoidable; yet quick detection and identification of the cause of failure can make a communication system more robust, and its operation more reliable. However, when a fault occurs in a network, an operator is often overwhelmed with messages, making fault localization a difficult task. Too much information has the same effect as too little information, i.e., fault identification is made more complex.
Since communication networks typically consist of devices independently manufactured by different vendors, the internal implementation of these devices commonly varies. (But the interface of each device with the rest of the network is standardized to conform to widely accepted standards (e.g., SNA, ISO, etc.)). Thus, each network device is typically independently designed. The designer of a communication system device usually ensures that both the device and its perceived interface, i.e., the rest of the network projected into the device's observation space, are working correctly. A natural design process includes designing alarms for the various fault conditions that the device may encounter when in operation. Thus, a device designer typically provides two types of alarms: (1) alarms for faults that exist within a device; and (2) alarms for faults that appear at the interface with which the device has to conform.
A fault within a device may disrupt its operation as well as its behavior towards other devices. This may cause many network devices to emit alarms indicating problems with their interfaces. (Traditionally, a device alarm consists of a text string and possibly a unique alarm identifier.) Thus, the system administration can become overwhelmed with alarms generated from the same basic problem. Even though in the abstract it may appear that more information assists in diagnosing a problem, in reality it may not be so. Usually alarm messages do not carry the explicit information needed to diagnose a fault. Rather, alarms typically describe in detail the faulty condition, i.e., the symptom of the fault. They do not normally describe the cause of the fault.
With a multitude of alarms, it can be difficult to:
(1) localize a fault. In most cases alarms do not explicitly indicate the location of a fault. An analysis of the emitted alarms must be performed to pinpoint the problem area of the network.
(2) correlate alarms. It is difficult for a human operator or even a software program to examine the hundreds of alarms which may occur substantially simultaneously in a network and assign those alarms to one or more particular fault conditions.
The present invention seeks to address these problems. Specifically, methods and systems are provided which examine emitted alarms, and the topology of a telecommunications network, to localize the area of the network where a fault has occurred and to correlate received alarms with one or more faults within the network.