1. Technical Field
This invention relates to the field of event correlation and, more particularly, to a method and apparatus for efficiently determining the occurrence of and the source of problems in a complex system based on observable events. The invention has broad application to any type of complex system including computer networks, satellites, communication systems, weapons systems, complex vehicles such as spacecraft, medical diagnosis, and financial market analysis.
2. Related Information
As computer networks and other systems have become more complex, their reliability has become dependent upon the successful detection and management of problems in the system. Problems can include faults, performance degradation, intrusion attempts and other exceptional operational conditions requiring handling. Problems generate observable events, and these events can be monitored, detected, reported, analyzed and acted upon by humans or by programs. However, as systems have become more complex, the rate at which observable events occur has increased super-linearly, making problem management more difficult.
As in example, when the number of computer nodes in a network increases, the network complexity increases super-linearly with the number of nodes, with a concomitant increase in the fault rate. Compounding this problem of network complexity is fault propagation between both machines and network protocol layers; these propagated faults can generate additional events.
Automated management systems can help to cope with this increase in the number and complexity of events by (1) automating the collection and reporting of events, thereby reducing the load on human operators or programs; (2) using event correlation techniques to group distinct events, thereby compressing the event stream into a form more easily managed by human operators; (3) mapping groups of events to their underlying causes, thus reducing the time between faults and repairs; and (4) automatically correcting diagnosed problems, thereby minimizing operator intervention.
Event correlation and management techniques are a particularly important method of reducing the number of symptoms in a system which need to be analyzed and accurately determining the number and identity of discrete problems which need to be rectified. Unless events are correlated, a single problem in a single subsystem could result in multiple, uncoordinated corrective actions. This can lead to wasteful resources spent on duplicate efforts and inconsistent corrective actions which result in an escalation of problems.
Conventional and previously proposed approaches to managing faults in a system have failed to fully address the increase in complexity and have failed to provide adequate performance for large systems, as outlined more particularly herein. In order to discuss these problems, it is first necessary to understand these other approaches.
Event correlation and management approaches can be generally grouped into five categories: (1) rule-based reasoning; (2) case-based reasoning; (3) reasoning with generic models; (4) probability networks; and (5) model-based reasoning. In addition, a number of different architectures have been considered to carry out event correlation and management. In order to review these approaches, the following terminology is defined:
KNOWLEDGE REPRESENTATION: The format and means for representing knowledge about the system being monitored, such as the types of network components and the network topology. Such knowledge may be stored in a hierarchical relational or object-oriented database.
KNOWLEDGE ACQUISITION: The methods and means for acquiring the knowledge about the system to be monitored. Ideally, knowledge is automatically obtained during system operation to minimize human resource requirements. However, in actuality much knowledge acquisition involves humans familiar with the operation and idiosyncrasies of a system.
EVENT CORRELATION: The methods and means for detecting the occurrence of exceptional events in a complex system and identifying which particular event occurred and where it occurred. The set of events which occur and can be detected in the system over a period of time will be referred to as an "event stream." It will be noted that the location of the event is not necessarily the location where it is observed, because events can propagate across related entities in a system. Although every possible reportable measurement (such as voltage level, disk error, or temperature level) could be considered to be an "event", many of these measurements do not contribute to identifying exceptional events in the system. Event correlation takes as input an event stream, detects occurrence of exceptional events, identifies the particular events that have occurred, and reports them as an output.
Event correlation can take place in both the space and time dimensions. For example, two events whose sources are determined to be in the same protocol layer in the same network element may be related spatially. However, they may not be correlated if they occur on different days, because they would not be related temporally.