There are at least two types of network problems; outages and poor performance. An “outage” usually refers to an unavailable resource and “poor performance” typically refers to unsatisfactory responsiveness of the system to the user is not within the range set forth in an SLA (service level agreement) or other requirement. Each type of problem may be caused by any of the various types of network problem causes, for example, physical problems, logical problems, or capacity problems. These types of problems may be further characterized by various conditions. A “physical problem” is typically some piece of hardware being broken and either in a failed state, or is intermittently failing. A “logical problem” typically refers to software or firmware not working as intended because of a flaw in design or configuration or customization. Further, “capacity” problem typically implies that a mathematical threshold (actual or artificially limited) in a component or across a set of components has been exceeded in such a way as to adversely affect availability or performance.
Problem management tools have been developed and have been strongest in identifying individual components, as opposed to system-wide conditions, that are contributing to problems. This should not be surprising, as recognition of a problem state caused by an individual component is usually significantly easier than one caused by a problem ranging across multiple components. In fact, the two types of tools most prevalently used for network management illustrate this fact. Examples include console-based tools with iconic displays of the network such as IBM's NetView™, Hewlett-Packard's OpenView™ and also MIB control block reading tools such as Concord's e-Health™ and Lucent's Vitalnet™.
NetView™ and Hewlett-Packard OpenView™ were initially developed in the 1980s and remain cornerstones of network management. These tools, as a matter of routine, poll the network to discover the network's devices along with the communications links connecting the devices. Each discovered device is often represented by an icon and each connection is usually represented by a line on a management console display that then depicts the network. The icons and line are frequently colored red, amber or green (hence the name, “RAG” display for the console) depending on whether the status of the device is down (red), either unknown or functional but compromised (amber), or functioning normally (green). These tools continue polling devices after discovering them, so that the RAG display may change as device reachability changes, thereby providing the operations staff with rapid, easily recognizable notification of changes in network device status.
In the 1990's, as microprocessors became less costly and more powerful, it became possible to embed additional intelligence and storage into network devices. As a result, it became possible to make network devices more “self-aware” in the sense that that they could recognize their own internal state (such as their internal processor and storage condition and utilization as well as their network ports' conditions and utilizations). Moreover, this newly available self-awareness data was formally organized by standards bodies in device control blocks known as MIBs.
At the same time, peer-to-peer protocols such as TCP and APPC were proliferating, which not only allowed the NetView™ and OpenView™ tools to more easily retrieve this new and additional data from the network devices, but also allowed the newly intelligent network devices to send unsolicited, important status information to the management tools for even faster problem notification to the operations staff. As the 1990's came to a close and the millennium passed, network and systems management research and development staffs continued along the path of improving systems management by enhancing the microprocessor-based self-awareness of devices, even formalizing the discipline and calling it “autonomics.”
Tools such as e-Health™ and Vitalnet™ exhibit a similar development history to that of the console tools. Like the console tools, e-Health™ and Vitalnet™ are capable of retrieving MIB data. However, unlike the console tools which are intended to provide realtime management of the network, these tools are typically used for trend reporting. The classic use of these tools has, and continues to be, developing “heat map” reports. These reports identify network links whose utilization exceeds some pre-set threshold value over some specified period of time. Usually the purpose of the heat map report is twofold; first, to identify utilization hot spots that may possibly be causes of poor performance; and second, to identify links that may require a speed upgrade, especially since such upgrades often need to be ordered and planned for well in advance of actual installation.
One additional class of problem is the “logical problems.” These include design, customization, and configuration problems. Tools to diagnose these types of problems are generally not yet developed. Instances of tools that actually are currently available and may be able to model optimized network routing, such as OpNet™, require significant amounts of time and expertise to run, and are not in general use.
Regarding the current state of the art of network management, the addition of self-diagnosing autonomics and reachability testing has improved both the success rate and speed of diagnosis of broken devices. The commonly known heat map concept, largely unchanged for a number of years, remains generally effective in recognizing and forestalling problems involving overutilization. However, these tools are not one hundred percent effective, and when problems occur which these tools fail to diagnose, resolution efforts often become chaotic and of unacceptable duration. The reason why these problems get out of hand is often because once the tools have failed to provide conclusive diagnostics for a problem, there remains no orderly procedure or method for diagnosis and resolution, with the result being that “all” possible diagnostic paths are followed, which elongates resolution time and increases risk of compounding the problem.
In accordance with the above, the current art of network problem determination may suffer from one or more issues. Such as, when there is a performance or outage problem, it may be caused by a physical problem, a logical problem, or a capacity problem. The remediation for physical and logical problems generally requires replacement or repair of the failed component, whether it is a logic board in a device or a version of software. The remediation for capacity problems generally involves adding capacity or adjusting the tuning of the system. When diagnostic tools fail to pick up the true cause of an outage or an instance of poor performance, the problem resolution effort often reverts to various attempts such as, for example, an across-the-board trial and error method of swapping cards, cleaning cables, changing software and microcode levels, adding capacity, re-tuning the system, and the like. The hope is that one of the changes (usually performed one at a time) may give positive results.
The deficiencies with this methodology include taking too much time, it is risky, and that it tends to pit the hardware, software, and systems staffs against each other. More specifically, by reverting to a shotgun approach when management tools fail to lead to a proper diagnosis of an existing problem, there is risk that the trial-and-error remediation-effort changes made to the system might make the situation worse. For example, when there is a hard or intermittent problem that defies diagnosis, remediation efforts may include reseating cards and swapping or cleaning cables, with each of these efforts exacerbating the problem. Illustratively, reseating a card may result in bending connector pins, thereby worsening the problem. Similarly, cleaning cable connectors, changing microcode levels, or swapping cards risk introducing new problems into the system. Similarly, swapping cable paths to test alternate connectivity may require altering cable switching device settings which is an error prone procedure that can cause additional problems and worsen the situation.
The current methodology for troubleshooting typically includes, for example, examining the RAG console (e.g., NetView™ or OpenView™) for red or yellow icons, which are indicative of devices that are broken or whose status is unknown and repairing whatever is broken. It is also possible to check the heat map report of known over utilized links, and check MIB values for over utilization along the path(s) involved in the problem. As a cure, adding capacity or reducing traffic if there is over utilization of a resource may be instituted. If the problem is not fixed by either of the two previous actions, then in whatever order is approved by management, changes to the system and the metrics of the system may be made (where changes to the system include such actions as reseating, swapping, and replacing hardware and modifying software and microcode; and changes to the mathematics or metrics of the network include adding capacity or changing tuning). In the current methodology, if the use of the management console and MIB tools have-failed to produce a solution, then risky and time-consuming probing and speculative system changes may be attempted, in no particular order, in hopes of a cure.