The invention relates to the field of network administration, and in particular to the field of software tools for assisting network management through diagnosis of network problems so that appropriate enhancements and repairs may be made.
Many voice and data networks occasionally have problems. These problems may include a noisy or failed link, an overloaded link, a damaged cable, repeater, switch, router, or channel bank causing multiple failed links, a crashed server or an overloaded server; a link being a channel interconnecting two or more nodes of the network. Networks may also have software problems, a packet may have an incorrect format, be excessively delayed, or packets may arrive out of order, be corrupt, or missing. Routing tables may become corrupt, and critical name-servers may crash. Problems may be intermittent or continuous. Intermittent problems tend to be more difficult to diagnose than continuously present problems.
Networks are often distributed over wide areas, frequently involving repeaters, wires, or fibers in remote locations, and involve a high degree of concurrency. The wide distribution of network hardware and software makes diagnosis difficult because running diagnostic tests may require extensive travel delay and expense.
Many companies operating large networks have installed network operations centers where network managers monitor network activity by observing reported alarms. These managers attempt to relate the alarms to common causes and dispatch appropriate repair personnel to the probable location of the fault inducing the alarms.
Y. Nygate, in Event Correlation using Rule and Object Based Techniques, Proceeding of the fourth international symposium on integrated network management, Chapman and Hall, London, 1995, pp 279 to 289, reports that typical network operations centers receive thousands of alarms each hour. This large number of alarms results because a single fault in a data or telecommunications network frequently induces the reporting of many alarms to network operators. For example, a failed repeater may cause alarms at the channel banks, nodes, or hosts at each end of the affected link through the repeater, as well as alarms from the switches, routers, and servers attempting to route data over the link. Each server may also generate alarms at the hardware level and at various higher levels; for example a server running TCP-IP on Ethernet may generate an Ethernet level error, then at the IP level, then at the TCP level, and again at the application level.
Alarms from higher levels of protocol may be generated by network nodes not directly connected to a problem node. For example, alarms may arise from a TCP or application layer on node A of a network, where node A connects only to node B, which routes packets on to node C, when C""s connection to node D fails, where node D was being used by an application on node A.
Alarms may be routed to a network management center through the network itself, if sufficient nodes and links remain operational to transport them, or through a separate control and monitor data network. The telephone network, for example, frequently passes control information relating to telephone callsxe2x80x94such as the number dialed and the originating numberxe2x80x94over a control and monitor data network separate from the trunks over which calls are routed. It is known that many networks have potential failure modes such that some alarms will be unable to reach the network management center.
Loss of one link may cause overloading of remaining links or servers in the network, causing additional alarms for slow or denied service. The multitude of alarms can easily obscure the real cause of a fault, or mask individual faults in the profusion of alarms caused by other problems. This phenomenon increases the skill, travel, and time needed to resolve failures.
The number of alarms reported from a single fault tends to increase with the complexity of the network, which increases at a rate greater than linear with the increase in the number of nodes. For example the complexity of the network of networks known as the Internet, and the potential for reported faults, increases exponentially with the number of servers and computers connected to it, or to the networks connected to the Internet.
Event correlation is the process of efficiently determining the occurrence of and the source of problems in a complex system or network based on observable events.
Yemini, in U.S. Pat. No. 5,661,668, used a rule-based expert system for event correlation. The approach of Yemini requires construction of a causality matrix, which relates observable symptoms to likely problems in the system. A weakness of this approach is that a tremendous amount of expertise is needed to cover all cases and, as the number of network nodes, and therefore the number of permutations and combinations increase, the complexity of the causality matrix increases exponentially.
A first node in a network may run one or several processes. These processes run concurrently with, and interact with, additional processes running on each node with which the first node communicates.
One of the small minority of computer languages capable of specifying and modeling multiple concurrent processes and their interactions is C. A. R. Hoare""s Communicating Sequential Processes, or xe2x80x9cCSPxe2x80x9d, as described in M. Hinchey and S. Jarvis, Concurrent Systems: Formal Development in CSP, The McGraw-Hill International Series in Software Engineering, London, 1995. The CSP language allows a programmer to formally describe the response of processes to stimuli and how those processes are interconnected. It is known that timed derivatives of CSP can be used to model network processes, Chapter 7 of Hinchey, et al., proposes using a CSP model to verify a reliable network protocol.
Timed CSP processes may be used to model network node behavior implemented as logic gates as well as processes running on a processor in a network node. For purposes of this application, a process running on a network node may be implemented as logic gates, as a process running in firmware on a microprocessor, microcontroller, or CPU chip, or as a hybrid of the two.
Other languages that provide for concurrency may be used to model network node behavior. For example, Verilog and VHDL both provide for concurrent execution of multiple modules and for communication between modules. Further, Occam incorporates an implementation of Timed CSP.
With the recent and continuing expansion of computing, data communications, and communications networks, the volume of repetitive, duplicate, and superfluous alarm messages reaching network operations centers makes understanding root causes of network problems difficult, and has potential to overwhelm system operators. It is desirable that these alarm messages be automatically correlated to make root causes of network problems more apparent to the operators than with prior event correlation tools.
Each node A in a network has a model, based upon a formal specification written in a formal specification language such as a timed derivative of CSP, of the expected behavior of each node B to which it is connected. Each node of the network also has an alarm monitor, again written in a formal specification language, which monitors at least one process running on that node. The model or models of expected behavior of each node B is compared with the actual responses of each node B by an additional alarm monitor.
Alarms from the alarm monitors are filtered by a local alarm filter in the node, such that the node reports what it believes to be the root cause of a failure to a network management center. In this way the number of meaningless alarms generated and reported to the network management center is reduced and the accuracy of the alarms is improved.
Alarms reported to the network management center are filtered again by an intelligent event correlation utility to reduce redundant fault information before presentation to network administration staff.