With the continued focus on reducing Information Technology (IT) support costs, distributed computing environments need to develop a method to efficiently monitor and manage IT assets so that labor costs can be reduced. However, monitoring performance and conducting system management of distributed computing environments has become more labor intensive because of the larger number of users, geographically diverse sources of data, and other items that have made system management more complicated and labor intensive. The systems to be monitored by IT personnel often include complex computer networks that may include numerous mainframes, minicomputers, workstations, etc.
Traditionally, computer implemented network management systems have concentrated on providing a set of fault isolation and test functions that allow a human operator to locate, diagnose and isolate network problems. Network problems are often expressed by the target network devices or “objects” in the form of alarms or other error messages. Alarms can generally be considered “events” reported by target network devices when abnormal conditions exist. In some networks, alarms are generated autonomously while in others the alarms are actually responses to queries (polls) (both will be referred to as alarms for purposes herein). Upon receiving the alarms from the network, the network management system displays the alarms on the operator's console (such as Tivoli® Enterprise Console). One of the operator's responsibilities is to interpret the alarm and then isolate and resolve the problem associated with the alarm in the shortest time span. The operator then uses a series of test procedures to determine the exact cause of the problem. Once found, he may take remedial actions and then move on to the next alarm. Obviously, alarm/event processing includes labor-intensive action. When events are presented to operators at a console, operators respond to those events by manually validating the events and creating incident records, and help-desk personnel convert these records into problem tickets. The problem tickets are then dispatched to the responsible entity for remedial action.
In a specific example, a network management system (NMS) displays a detected event (alarm) at an operations console, and then the operator or end system administrator manually validates that event. Event validation is a general requirement because the amount of “noise” or false alerts generated by an enterprise scale NMS systems. When performing large-scale monitoring, “false positive” detections can be caused by transient network anomalies or reporting inconsistencies. Therefore, diagnosing faults, including validating events, using manual management is time consuming and requires intimate knowledge of the distributed system.
To some extent this noise can be reduced with monitoring threshold tuning, but such timing is not enough to ensure higher levels of noise reduction. Manual process event validation has been problematic. Between different human operators, the speed of event validation and the accuracy of event validation can vary widely. In periods of peak activity, a less efficient operator can experience an event log back-up, and unneeded delays are introduced into the event processing stream. These delays result in an incremental increase in the mean time to restore a faulty system. In addition to introducing such delays, the interpretation of events and event triage data can vary between operations personnel. One operator may be more knowledgeable about an event type, and perform a more exhaustive manual validation process. This introduces inconsistencies in how events are processed, and impacts service delivery.
A number of patents and published applications exist which relate to systems management and event monitoring including, U.S. Pat. Nos. 5,159,685; 5,664,093; 5,699,502; 5,777,549; 6,230,198; 6,255,943; 6,356,885; 6,401,119; 6,446,134; 6,477,667. These systems do not show or suggest features that would eliminate manual operator intervention for validation of the status of events (alarms and objects).
Accordingly, there is a need in the art for improvements in event monitoring for system management that eliminates the need for manual intervention. It can also be seen, then, that there is a need in the art for a way to reduce the display of false positives or notifications for transient events on an operator's console. The present invention is designed to address these needs.