The difficulty of managing a communications network is directly proportional to its complexity. As networks grow in complexity, so too does the difficulty of managing it. Managing a network includes one or more of the following: retrieving historical performance, observing that the network is currently functioning properly, and ensuring that the network will function properly in the future. To accomplish each of these functions, feedback from the network is necessary. The most widely relied upon feedback are alarms.
Alarms provide feedback that element interaction or network elements themselves are not functioning as intended. But a complex communications network may produce on the order of thousands of alarms per hour or millions of alarms per day. An alarm may be referred to in the art as a message, alert, event, warning, or other data indication. Being able to maintain awareness of the potential barrage of alarms, as well as troubleshooting the source of the alarms has historically been a resource-intensive process that plagues network administrators.
Maintaining and determining the root cause of a problem in a communications network is dependent on the knowledge and experience of technicians monitoring alarms originating from the network. The task of monitoring incoming alarms is made even more difficult if the communications network is especially large and comprises many elements that at any given time may have problems that affect network service. The topology data of a network is oftentimes incomplete and data is spread out among several data structures. Because of incomplete topology data, the task of determining the root cause of the problem is time consuming and costly, requiring the technician to know which data structure to access for which topology.
Furthermore, network communication architectures comprise many layers of facilities. Although there are many definitions of facilities, one definition may be a portion of a network that carries traffic at a continuous bandwidth. Moreover, network architectures may have built in protection schemes which, even though one or more facilities are disabled, allow continuous flow of traffic from upstream facilities to downstream facilities and, eventually, to end users. In other words, even though an upstream facility is disabled, downstream facilities may not notice the disabled upstream facility due to the protection scheme. However, network technicians, monitoring the facilities, are aware of a problem within the communications network. The technician may be receiving multiple alarms or alerts transmitted from various network elements within the facility. Possible scenarios for the disabled facility may be a faulty network element or a severed transmission pathway utilized by the disabled facility. The technician may have to physically inspect each alarming network element supporting the disabled facility to determine the root cause of the problem. Facilities may have upwards of thousands of elements spread out over vast distances. To inspect each alarming network element is time consuming and costly.
Also, each different component in a communications network has its own protection scheme in the event the component is alarming. In some cases, there is a one to one relationship between a working component and a protection component. However, in other cases, there is a one to many relationship between working components and protection components, or one protection component for several working components. Generally, technicians monitoring the communications network group alarms into service affecting alarms and service impacting alarms. Service affecting alarms may be understood as those alarms that do not impact service to the end user of the communications network, but have the potential to do so. Likewise, service impacting alarms may be understood as those alarms that impact service to the end user of the communications network. Both types of alarms require a thorough understanding of the protection scheme for each component. The protection scheme for each component should be evaluated and then a determination may be made regarding whether the alarming component is service impacting. Generally, a large long distance telecommunications network includes many components and their associated protection schemes. Evaluating whether an alarm is service impacting or service affecting may become very complicated in such networks.
As discussed above, technicians monitoring a communications network must evaluate whether an alarm is service impacting or service affecting. Service impacting alarms generally receive the highest priority. Each severe alarm from a network element creates a ticket documenting the alarm. Also, customers serviced by the communications network may call in to a service center when their service is impacted and report problems, thus creating a customer called-in ticket. Still other tickets may be generated by other entities associated with the communications network besides the monitoring technicians. All of the aforementioned tickets may be related to the same network problem. In large networks, many thousands of tickets of various types may be generated. The technicians monitoring the system may clear one type of ticket relating to a problem, but this may not affect other types of tickets relating to the same problem. Technicians must manually sift through the various tickets as they arrive and determine if the ticket relates to a previously reported ticket or relates to a new problem. Associating the various ticket types is laborious and an inefficient use of the technicians time, especially in association with large communications networks.
Still other inefficiencies plague alarm monitoring in communications networks. Generally, the technicians monitoring a communications network are grouped according to network elements. For example, a set of technicians monitor one type of network element, while other technicians monitor other types of network elements. Each technician typically receives all alarm data related to their assigned network element. There may be multiple technicians monitoring a particular type of network element, but each technician is concerned with only a subset or subsets of the incoming alarm data. Technicians must pull from a data structure the alarm data in which they are interested in viewing on their user interface. This may be a bandwidth intensive process, especially when the communications network is large and includes many elements. The efficiency of handling alarms is decreased because of delays in communicating the alarm data to the technician. Furthermore, the process is bandwidth intensive which creates scalability problems. To accommodate additional technicians or user interfaces, additional data structures must be added to the communications network. The data structures must be fault tolerant and require maintenance and support, which runs the costs of adding additional data structures into the hundreds of thousands of dollars. In large communication networks, generating thousands of alarms, this is a cost prohibitive solution.