1. Field of the Invention
The present invention relates to networking systems. More specifically, the present invention relates to a correlation method and apparatus for correlating network faults over time in order to perform network fault management in a networking system, and monitor the health of devices in the network.
2. Background Information
As networking systems proliferate, and the problems involved in configuring and maintaining such networks also increase, network management becomes an increasingly complex and time-consuming task. One of the primary considerations in managing networks, especially as their size increases, is fault management. Very simple types of fault management may be performed by determining when communication links and devices in the system fail, then, some sort of corrective and/or diagnostic measures may be taken such as manual connection and reconnection of physical links, and testing and diagnosis of network devices and/or end stations.
More sophisticated techniques for fault detection and diagnosis include receiving traps or other types of alert signals from devices within the network. As faults are detected, devices can alert a centralized device such as a computer system or other networking system management console, that such faults have occurred. These prior art techniques have suffered from some shortcomings, however. First, typical prior art fault detection and diagnostic systems include centralized consoles which receive and record fault alert signals or traps as they occur. Management tools which provide such diagnostic capability frequently rate faults received from units in the networking system according to their severity. Unless a certain number of traps are received of a particular type, according to predefined rules, then no action is taken upon the traps.
A fundamental problem with these pre-existing systems is that because functionality is concentrated in a single device in the network, networks errors at various devices in the network may not be able to be detected. Moreover, these errors may occur in such a volume that actual network errors may be obscured. In fact, some errors may be lost due to the large volume of faults at the single device. Because a large amount of faults may be generated which do not indicate any specific problems in the network (e.g., transient faults), errors indicating actual severe faults actually requiring action may go unnoticed.
Yet another shortcoming of certain prior art systems includes the ability to determine whether the detected faults are indicative of a one specific problem identified by the fault type, rather than a symptom of a different problem. Multiple faults of a specified fault type may need to be detected in order for a one particular problem type to be identified. Thus, individual faults which are detected are simply "raw" error data and don't necessarily indicate an actual problem. These may, given certain circumstances, indicate a specific problem, and current art fails to adequately address the correlate multiple faults over time intervals to identify specific problems.
Another fundamental shortcoming of prior art network diagnostic techniques is that such prior art techniques typically rely upon a single count of a number of errors of a particular type occurring. This technique, known as "filtering", has fundamental shortcomings in that it does not provide for other types of measurement of faults such as time such faults are occurring, number of faults within a given time period, or other more sophisticated approaches. Moreover, some prior art diagnostic systems only provide records of faults, but do not, based upon other measured fault characteristics, attempt to determine a possible reason for a fault or group of faults, and moreover, do not offer any practical solutions to a network manager or other user.
Other prior art solutions to network fault management include displaying the status of network devices in a manner which allows, at a glance, to determine whether a given device is functioning or not. These solutions include displaying color-coded iconic representations of devices on a computer console based upon single polls or "pings" of devices in the networking system. This solution fails to take into account intermittent failures of links and/or devices which may only occur a single time, the time of the poll, or sporadically, and which do not necessarily pose any substantial threat to normal network operation. Other prior art solutions show network health in this manner using user-defined state machines which are used by the management console. Both of these solutions usually rely upon simple displays of representations of individual devices in the network rather than displays at various levels of abstraction in the system, including, the port, slot, chassis and device level. In addition, none of these prior art solutions use network topology information in order to determine how network health changes for related devices causes corresponding changes in each of the related devices' health.
Thus, the prior art of network fault detection and network health monitoring has several shortcomings.