1. Field of the Invention
The present invention generally relates to computer system monitoring. Specifically, the present invention relates to a method system and program product for optimizing event monitoring filter settings and metric thresholds.
2. Related Art
As computer infrastructures continue to grow in size and functionality, system monitoring is becoming an increasingly important function. For example, many organizations utilize data centers that can include any quantity (e.g., 1000) of computer systems or machines. In such data centers, each computer system will typically process a workload. As such, each system has hard error states where no processing occurs, and soft error states where processing occurs slowly. In general, the soft error states tend to be more transient meaning they tend to dissipate or self-clear. With any error state, however, there is a minimum (feasible) detection and repair time that reflects the business value of a system and the workload it processes, the simplicity of the underlying problem, the disruptiveness of the repair procedure, and other factors.
To ensure adequate operation of such infrastructures, monitoring systems are continually being developed to detect error states (also referred to as “events”). One example of a currently available monitoring system is “IBM Tivoli Monitoring” (ITM) by International Business Machines Corp. of Armonk, N.Y. One issue encountered by any monitoring system is the registration of false events. Specifically, an ideal monitoring system will only register a truly positive event. The registration of false events leads to wasted time and resources. To help avoid the registration of false events monitoring systems such as ITM can utilize filters.
Unfortunately, determining optimal settings for such filters is currently an ad hoc procedure and is not driven from a principled approach. As such, there is currently great difficulty in predicting the behavior of any given filter setting. This leaves unanswered the questions surrounding false positives (e.g., too many events that require no action) and false negatives (e.g., missed events that could require action). This typically leads to conservative filter configurations, leading to an “event flood” (i.e., a constant stream of events), most of which are ignored. Current monitoring systems are thus driven by end-user complaints, which are used to sift through the flood of events to determine which events probably require action and which events may still be safely ignored.
In view of the foregoing, there exists a need for a method system and program product for optimizing event monitoring filter settings so that events that truly require action can be readily identified.