Technical Field
This disclosure relates generally to network management tools.
Background of the Related Art
Maintaining the proper operation of various types of computerized services is usually an important but difficult task. Service administrators are often called upon to react to a service failure by identifying the problem which caused the failure and then taking steps to correct the problem. To avoid wasting resources investigating the wrong problems, administrators must make accurate assessments as to the causes of failures. Because substantial time and resources are often required, administrators must also make accurate decisions as to when to allocate resources to the tasks of identifying problems and fixing them.
A number of network management tools are available to assist administrators in completing these tasks. Network management systems discover, model and maintain knowledge bases of network devices and their connectivity, and provide mechanisms to actively monitor the network proactively to identify network problems. IBM® Tivoli® Netcool® is a suite of applications that allow network administrators to monitor activity on networks, to log and collect network events, including network occurrences such as alerts, alarms, or other faults, and then report them to network administrators in graphical and text-based formats. Using such tools, administrators are able to observe network events on a real-time basis and respond to them more quickly. Such systems also typically include network service monitors of various types which measure performance of a network so that, among other things, network resources can be shifted as needed to cover outages. A system of this type may also include a configuration management tool to automate network configuration and change management tasks. This enables network operators and administrators to enhance network security by controlling access by users, devices and commands, maintain the real-time state of the network, and automate routine configuration management tasks.
While these tools provide significant advantages, fault management occurs after-the-fact, i.e., after the issue or incident has already occurred and for the purpose of minimizing the damage already done. Indeed, root cause analysis, although sophisticated, is designed to drive recovery automation and related approval processes before corrective commands are inserted into the affected network element (e.g., a router or switch). The problem with this approach is that the corrective action itself may cause new problems. For example, a network management tool may suggest a corrective course of action, such as instructing a network engineer to open a port when the result of that action causes a broadcast packet storm that then floods the network with packets and interrupts other services. When the corrective action itself causes new issues, further operational costs and network downtime often result.
There remains a need in the art to provide new techniques for network management that addresses these and other deficiencies in the known art.