1. Field of the Invention
The present invention pertains to system performance and health assessment. More particularly, this invention relates to autonomously assessing health of a computing hardware or software element or a service in a networked system using statistical analysis and probabilistic reasoning.
2. Description of the Related Art
To date, the management of a multi-element system (either a hardware system or a software system) is typically done by monitoring many variables of the system operation across time, and by noting the occurrence of abnormal events in the system. One prior art approach of determining the abnormal events employs predetermined static threshold values, as shown in FIG. 1. The threshold value used is typically based on experience and/or intuition.
The observed and monitored information is then presented to a system administrator. The system administrator is a human being who, based on the information received, assesses the “health” of each of the elements of the system. As is known, this assessment by the system administrator is essential in trouble-shooting existing problems, or in trying to detect failures early, before they propagate to users of the system.
Improvements have been made in the past to this prior art approach. For example, the collection of the monitored information can now be done by using agents to monitor particular “managed objects” and report their findings to a central management console (or a hierarchically organized set of consoles). Another example is the use of tree-based GUI (Graphic User Interface), some with geographic mapping, to improve the presentation of the monitored information, thus making it easier for the system administrator to navigate the managed objects. Embedded graphing packages make it easier for the system administrator to notice trends and trend changes.
However, even with these improvements, the prior art approach is still not suitable for measuring large dynamic distributed systems with large numbers of elements. A distributed system typically operates in a distributed or federated computing environment. One example of such a distributed system is the Internet.
One key reason for the unsuitability is that the prior art approach requires the human system administrator to make the assessment. There are a number of disadvantages to this requirement. One disadvantage is that for an always-on system, system administrators must be staffed around the clock. In addition, as the number of elements and the complexity of a monitored system increase, the system administrators typically work under greater and greater stress.
Another disadvantage is that the health assessment is a knowledge intensive task. It typically requires significant experience to perform the assessment accurately since patterns are learned over time. This means that companies hiring system administrators must pay higher salaries for the experience. As more and more companies migrate to the Internet, the demand for experienced system administrators grows accordingly. As a matter of fact, it is well known that the demand for such system administrators greatly exceeds the supply.
Prior attempts have been made to address this issue. One prior attempt employs neural network technology to automatically predict upcoming system failures. However, this prior attempt does not address the issue of assessing health of an element or service within a distributed system. Another disadvantage of this prior attempt is that system-specific training is required before the neural network system can be deployed. This prevents the prediction system from being widely adopted or applied.