Monitoring computer system or application performance is a complex task. Many metrics (CPU utilization, queue lengths, number of threads, etc.) can contribute to an overall measure of system performance. One approach to measuring system performance has been to identify specific metrics for which explicit numeric thresholds are set and tested at specified intervals. When the specific metrics exceeds the threshold an alert event is signaled indicating an error condition.
This approach, however, has drawbacks. Individual metrics in isolation are poor indicators of overall system state. Metric values may tend to oscillate in a range that could trigger and retrigger alarms as they cross the fixed threshold. Additionally, metric values may naturally vary over a wide range, making the selection of an appropriate threshold value very difficult. Consequently, there could be many false alarms. To reduce the number of false alarms, averaging of metric values or more complex triggering reset mechanisms have be suggested.
System health has also been monitored using a set of metrics that partition the system state into three modes, representing normal, warning, and error conditions. A “traffic light” iconic display has been used, where green indicates normal system state, yellow indicates a warning system state, and red indicates an error system state. While this type of display may be more intuitive than the binary alert approach, it adds increased complexity to the monitoring system because an algorithm must be derived to compute the ternary system state from the set of performance metrics or from a stream of binary alarm events. Often these algorithms are not exposed to the end-users or administrators and so they are unable to gauge the appropriateness of the green/yellow/red state classifications to the underlying performance metrics.