As computer networks, computer models, and other systems have become more complex, their reliability has become dependent upon the successful detection and management of problems in the system. Problems can include faults, performance degradation, policy violations, intrusion attempts and other exceptional operational conditions requiring handling. Problems can generate events, and these events can be monitored, detected, reported, analyzed and acted upon by humans or by programs. However, as systems have become more complex, the rate at which events occur has increased super-linearly, making problem management more difficult.
Automated management systems can help to cope with this increase in the number and complexity of events by (1) automating the collection and reporting of events, thereby reducing the load on human operators or programs; (2) using event correlation techniques to croup distinct events, thereby compressing the event stud into a form more easily managed by human operators; (3) mapping groups of events to their underlying causes, thus reducing the time between faults and repairs; and (4) automatically correcting diagnosed problems, thereby minimizing operator intervention.
Event correlation and management techniques are a particularly important method of reducing the number of symptoms in a system which need to be analyzed and accurately determining the number and identity of discrete problems which need to be rectified. Unless events are correlated, a single problem in a single subsystem could result in multiple, uncoordinated corrective actions. This can lead to wasteful resources spent on duplicate efforts and inconsistent corrective actions which result in an escalation of problem.