The present invention generally relates to the management of network systems, and more specifically to identifying link failures in a network.
A computer network generally includes a number of devices, including switches, routers and hubs, connected so as to allow communication among the devices. The devices within a network are often categorized into two classes: end stations such as workstations, desktop PCs, printers, servers, hosts, fax machines, and devices that primarily supply or consume information; and network devices such as gateways, switches and routers that primarily forward information between the other devices.
Network devices ordinarily operate on a continuous basis. Each device has one or more circuit boards, a microprocessor and a memory, and runs a control program. In general, networks often include several different types of data switching and routing devices. These network devices may have different physical characteristics. New devices, with characteristics that are presently unknown, are constantly being developed. In addition, the characteristics of many network devices may change over time. For example, characteristics of the network devices change when subsystems like boards, network interface modules, and other parts are added or removed from a device.
Many networks are managed, supervised and maintained by a network administrator or network manager. To properly maintain a network, the network administrator needs to have up-to-date information available about the devices in the network and how the devices are interconnected. The OSI network reference model is useful in classifying network management information. Information about what devices are in the network is generally called Layer 3 information, and information about how the devices are physically connected within the network is called Layer 2 information. In combination, this information may be used by the network administrator to understand the physical topology of the network. The topology is a mapping that indicates the type of devices that are currently included in the network and how the interfaces of these devices are physically linked.
In addition to understanding the physical topology of the network, the network administrator must have the ability to determine when a network failure has occurred. In addition, the network administrator must also have the ability to identify and isolate a particular failure that has occurred within the network, as a single failure can often affect the ability to communicate between many devices throughout the network.
For example, FIG. 1 illustrates a network 100 that includes a plurality of devices (A-G) that are configured to communicate with each other over links 102, 104, 106, 108, 110, 112. Devices A, B, C, D, E, F are, e.g., routers, switches, gateways, end stations, etc. If a communication failure occurs between device B and device C (a-failure of link 106), the ability to communicate between device B and device C is affected, and the ability to communicate between devices (A, G) and devices (D, E, F) is affected. Thus, to be able to address and correct a particular link failure, the network administrator needs to be able to determine that a failure has occurred within the network 100, and to identify and isolate the specific link within the network that has failed.
One method for identifying and isolating a specific link failure within a network is by probing the end-points of each link (pair of devices at the end of each link) to obtain communication data for each of the devices. For example, referring to FIG. 1, to isolate a failure within network 100, the network administrator may attempt to contact the end-points of each link 102, 104, 106, 108, 110, 112 within network 100 to retrieve SNMP data from each of the devices A, B, C, D, E, F, G. Based on the SNMP data that is retrieved, the network administrator may attempt to isolate the specific link that has failed within the network. However, a drawback with using this technique to isolate failures is that the SNMP data may not be readily available for each of the devices within the network. For example, the network administrator may not have access to the specific community string that is required to read the SNMP data from one or more devices with the network. Thus, the network administrator may not be able to retrieve the necessary information to identify the particular link that has failed. In addition, probing each device within the network to obtain the communication information can significantly increase traffic in the network, potentially reducing throughput of the network.
Moreover, a further drawback with probing the end-points of each link is that there may be any number of reasons that a particular link has failed. Thus, the network administrator generally needs to check a variety of different parameters on each device in attempting to determine whether a particular link is down.
In addition, many devices do not have as SNMP agents and therefore cannot provide the network administrator with SNMP data. For example, certain devices may be configured as proprietary devices that require the use of a specific protocol to retrieve communication information. Thus, the network administrator must be able to communicate using each of the communication protocols that are required by each of the devices. Still further, certain xe2x80x9cdumbxe2x80x9d devices may not provide any mechanism for retrieving management information that will allow the network administrator to determine whether a particular link has failed. Thus, the network administrator may have no mechanism for communicating with certain devices of the network.
Furthermore, certain types of failures may not be detectable based on the communication parameters of each device. For example, if a Layer 3 failure occurs between devices B and C (link 106), even if the communication parameters such as the Interface tables (IF tables) for devices B and C are available to the network administrator, generally they will not provide the information that is needed to determine that link 106 has failed. Thus, even if the network administrator is able to obtain the SNMP data for each of the devices within the network, it still may be impossible to identify the specific link that has failed.
Based on the foregoing, there is a clear need for a mechanism that can determine that a failure has occurred within a network.
There is also is a clear need for a mechanism that can identify and isolate a particular link that has failed within a network.
It is also desirable to have a mechanism that can identify a link failure without generating an undue amount of traffic within the network.
The foregoing needs, and other needs and objects that will become apparent from the following description, are achieved in the present invention, which comprises, in one aspect, a method for identifying link failures in a network. In one embodiment, information that represents a physical topology of the network is retrieved. A management station that is associated with the network is identified, and a path between one or more devices of the network and the management station is determined. Each path consists of one or more links connecting one or more devices within the network. For each link, a set of beyond link devices is determined that identifies only those devices that are beyond that particular link relative to the management station. A message is sent from the management station to one or more active devices within the network. Each message requests a response message to be returned to the management station from the device for which the message was sent. A set of non-responding devices, which consists of those devices that did not respond to the message sent from the management station to the device is determined. The set of non-responding devices is compared to each set of beyond link devices to identify link failures within the network.
The invention also encompasses a computer-readable medium, a computer data signal embodied in a carrier wave, and an apparatus configured to carry out the foregoing steps.