1. Field of Invention
Embodiments of the invention relate, in general, to network management. More specifically, the embodiments of the invention relate to methods and systems for handling fault messages in a network.
2. Description of the Background Art
A network connects various network devices and allows them to communicate with each other. A Network Management System (NMS) is connected to the network devices to manage configuration, accounting, performance, security, and network faults. Faults are network events that reduce the network's performance. Examples of faults include a port going down, a link going down, and unavailability of a network card that is being pulled out from a network device. When a fault occurs at a network device, the network device raises fault messages, herein referred to as source fault messages, and other network devices raise fault messages, herein referred to as related fault messages. These multiple fault messages are all conveyed to the NMS. Many, if not most, of these multiple fault messages are redundant and provide no meaningful information. The NMS must process the multiple fault messages to determine the principal cause of the fault by correlating the multiple fault messages. Then based on the principal cause, the NMS takes an appropriate action to protect the network from effects of the fault. Examples of appropriate action may be to update routing tables to bypass the source device and send alerts to network administrators identifying the source network device.
In conventional techniques, the NMS stores the multiple fault messages in a database and correlates the stored fault messages. The correlation tasks consume a lot of network resources such as CPU time, memory, disk space, administrator time and so forth. Conventional techniques for correlating the fault messages include rule-based correlation, codebook correlation, and manual correlation. Due to the complexity, the correlation process can often be time consuming which leads to an increase in network downtime. Moreover, the NMS typically discards some fault messages if the number of the fault messages received by the NMS is more than the capacity of the NMS thereby complicating the task of determining the principal cause of the fault messages.