1. Field of Invention
This invention relates to the field of network management. Specifically, the present invention relates to network fault management.
2. Description of the Related Art
Communications networks are used in a wide variety of military and commercial applications such as avionics applications, medical imaging applications, etc. With the exponential increase of modern day networks, network management has become a significant issue. A typical communication network includes a number of disparate devices (i.e. switches, satellites, various input devices, etc) made by different manufacturers and communicating with different communications protocols. Each of these disparate devices may represent a potential point of failure in the network. In addition, the devices themselves include multiple components such as processors or network interface cards (NIC), therefore each individual device may have multiple points of failure within the device itself.
Typically, network managers are used to monitor, detect, isolate and resolve device faults. Conventionally, network managers are implemented in software in a server placed at a location in the network. Many network devices, such as switches and network interface cards are passive, meaning that the devices only forward messages and do not originate messages. Therefore, a typical network manager will only detect a fault during a communications session or when the network manager loses communication with a portion of the network that includes these devices. As a result, user data may be lost or delayed.
The latency in fault detection is also an issue, since conventional network managers can only detect faults when a communication session is initiated, or when a portion of the network is inoperable. As a result, it becomes more difficult to correlate and isolate faults, especially when several faults occur at the same time. In addition, network managers capable of monitoring network devices at the component level, may not receive communication of a component fault at all, if the failure of the component renders the component/device inoperable or if the fault is in the communication path between the component and the network manger.
A second conventional technique used to manage network failures involves the use of a ‘heartbeat protocol’. The protocol is referred to as a ‘heartbeat protocol’ because it enables the network manager to send out periodic test messages to communications devices in the network. However, heartbeat protocols require network resources. As a result, the more frequent the test message the greater the depletion of network resources. As the size of the network increases and more devices need to be tested, the bandwidth required for the heartbeat protocol increases and the resources available for communication decrease.
Hence, a need exists in the art for a real time network management system that will facilitate the correlation and isolation of faults. There is a further need for a network management system capable of determining a network fault with minimal latency and minimal drain on system resources. Lastly, there is a need for a network management technique that enables the management of disparate devices, including passive devices.