The difficulty of managing a communications network is directly proportional to its complexity. As networks grow in complexity, so too does the difficulty of managing it. Managing a network includes one or more of the following: retrieving historical performance, observing that the network is currently functioning properly, and ensuring that the network will function properly in the future. To accomplish each of these functions, feedback from the network is necessary. The most widely relied upon feedback are alarms.
Alarms provide feedback that element interaction or network elements themselves are not functioning as intended. But a complex communications network may produce on the order of thousands of alarms per hour or millions of alarms per day. An alarm may be referred to in the art as a message, alert, event, warning, or other data indication. Being able to maintain awareness of the potential barrage of alarms, as well as troubleshooting the source of the alarms has historically been a resource-intensive process that plagues network administrators.
At least three aspects contribute to the difficulty of managing a communications network: vendor diversity, geographic disparity, and disparate equipment roles. The larger a network grows, the more likely it is that components will be provided by various vendors rather than by a single vendor. For example, a communications network may include Nortel switches, Cisco routers, Lucent network devices, etc.
Different vendors often indicate similar real-world happenings using different protocols, terms, phrases, or notifications. Consider two persons meeting. In western culture, a handshake is common. In eastern culture, bowing to each other is common. But in both cases, each is greeting one another. If a person from another foreign country witnesses both events, then it would be beneficial for a translator to explain that each course of action corresponds to the same event: a greeting. The problem is that, without an interpreter, the foreign witness will not realize that both courses of conduct correspond to the same happening, just with a different format.
A similar problem exists when disparate vendor components are used to communicate information corresponding to similar fault states, such as for example a loss of signal. If a Nortel device communicates a loss-of-signal notification in a first manner, but a Cisco device communicates a similar loss-of-signal notification in a second manner, then a scheme should be implemented so that each manner of communication is mapped to the same network ailment; here, loss of signal. In both cases, a loss-of-signal alarm should be conveyed to an analyst. Such mapping is wanting in the prior art.
Geographic challenges also contribute to the complexity of a network. A carrier should be able to identify what elements are present on its network, the location of those elements, and what functionality is offered by those elements. When problems occur, a carrier would preferably be able to identify the location of faulty devices. If a carrier does not know the location of a device that is causing an alarm, then responding to that alarm will be exceedingly difficult. The present invention addresses this need.
Different network devices perform different roles. Switching and routing components help determine where and how to direct data across a communications network. A network may be composed of several hundreds of different types of devices that perform different types of activities. When a specific type of component fails, then the functionality that the component was offering will be compromised. Describing or understanding the nature of what functionality has been compromised is also difficult but desirous.
A failed communications device can be queried or tested to help identify the nature of its problem. But the method of interrogation itself may vary across components. Moreover, no generic alarm set exists. Specific devices provide specific alarms in specific ways, which can make interpreting those alarms difficult. Unlike a physician treating a new patient, a network controller cannot simply ask all network devices a common question in a universal format, such as “what is wrong?” Rather, a troubleshooter must know which questions to ask and in what manner to retrieve troubleshooting data. For example, consider a routing device that is routing data to the wrong address. The device may be queried to determine a list of destination addresses. But such a query request would be wholly inappropriate to submit to a power supply that is providing power beyond acceptable tolerance levels.
Because of the briefly described complexities along with a myriad of other factors, identifying relevant alarms and addressing them is difficult. As previously mentioned, an alarm can assume many forms, from a warning to an indication of a severe problem. Historically, no distinction is made with respect to displaying the various alarms. Rather, each alarm is displayed on a user interface, which can get crowded and confused quickly. Moreover, the only information provided are the alarms themselves. And after they are remedied, they are deleted. No sort of root-cause analysis is performed on the alarms. The arduous task of determining the respective underlying causes of each alarm has historically been relegated to a human being. Trying to determine the various causes that gave rise to the plethora of alarms is a difficult task for a person to work on.
Consider the situation where a first alarm gives rise to multiple subordinate alarms. Without the benefit of root-cause analysis, a technician may begin allocating resources to resolving the subordinate alarms when resolution of the primary problem would solve the propagated problems. For example, consider a device that loses power. The loss of power would propagate other alarms related to whatever functionality the device was supposed to perform. The other alarms, as well as the power-loss alarm, would all be displayed for viewing in a control room. At this point, troubleshooting begins. Without the benefit of the present invention that will be described below, an analyst is not provided direction as to how to begin addressing each alarm. Although this simplistic example appears to be a relatively easy problem to solve by an experienced analyst, a carrier cannot rely on the subjective experience of an analyst nor upon such a simplistic example. Even if a carrier were to rely on such benefits, it would have to address such issues as a steep learning curve related to deciphering network alarms and the risks of losing personnel who have mastered a sense and feel for addressing primary alarms.
Still worse, some primary alarms can give rise to sympathetic alarms, which are alarms associated with otherwise properly functioning devices. In such a scenario, ultimate problem resolution can be prolonged because troubleshooting subordinate-alarm devices will yield results associated with the device working properly. Consider a telephone user who cannot make outgoing calls because, unbeknownst to him, his telephone line has been inadvertently cut. A substantial amount of time could be wasted if a technician were dispatched to troubleshoot the telephone. All of the tests initiated on the telephone would yield results consistent with a properly functioning telephone. The root cause does not lie within the telephone device itself, but rather with a compromised communications line. This coarse example provides an illustrative example of how resources can be wasted while attempting to resolve a child alarm that stems from a parent cause, wherein resolution of the parent cause would eliminate the child alarm.
During a network “firestorm,” an operations center may be bombarded with thousands or even hundreds of thousands of alarms. The nature of receiving such a large number of alarms from heterogeneous resources makes resolving the problems associated with the alarms difficult. It is very difficult to identify the most important alarms. Alarms are manually tracked down to laboriously attempt to determine any relationships between them. An experienced network specialist may be able to eventually track down and focus in on some of the causes, but such a process is manual, labor intensive, and time consuming. Such a problem is compounded when a new or less experienced specialist is charged with resolving the problems associated with the alarms and would generally become overwhelmed and intimidated by the circumstances.
Currently, problem solving and maintaining a telecommunications network is dependent on the knowledge and experience of the people monitoring the alarms. Dependency on one or more particular persons becomes a problem if one or all of the people were to quit their job. Another issue that carriers face is the inability to enrich topology information that equipment providers supply for determining the location of the component within the network. Currently, there no system is known of that provides a root-cause analysis capable of deciphering a top level event among a plurality of equipment providers.
What is needed is a robust message-enriching system that allows alarms from disparate network-element vendors to be received and troubleshooted using techniques that incorporate deduplication, thresholding, pattern-recognition, root cause analysis, and display management.