1. Technical Field
This invention relates to the field of event correlation and, more particularly, to a method and apparatus for efficiently determining the occurrence of and the source of problems in a complex system based on observable events. The invention has broad application to any type of complex system including computer networks, satellites, communication systems, weapons systems, complex vehicles such as spacecraft, medical diagnosis, and financial market analysis.
2. Related Information
As computer networks and other systems have become more complex, their reliability has become dependent upon the successful detection and management of problems in the system. Problems can include faults, performance degradation, intrusion attempts and other exceptional operational conditions requiring handling. Problems generate observable events, and these events can be monitored, detected, reported, analyzed and acted upon by humans or by programs. However, as systems have become more complex, the rate at which observable events occur has increased super-linearly, making problem management more difficult.
As an example, when the number of computer nodes in a network increases, the network complexity increases super-linearly with the number of nodes, with a concomitant increase in the fault rate. Compounding this problem of network complexity is fault propagation between both machines and network protocol layers; these propagated faults can generate additional events.
Automated management systems can help to cope with this increase in the number and complexity of events by (1) automating the collection and reporting of events, thereby reducing the load on human operators or programs; (2) using event correlation techniques to group distinct events, thereby compressing the event stream into a form more easily managed by human operators; (3) mapping groups of events to their underlying causes, thus reducing the time between faults and repairs; and (4) automatically correcting diagnosed problems, thereby minimizing operator intervention.
Event correlation and management techniques are a particularly important method of reducing the number of symptoms in a system which need to be analyzed and accurately determining the number and identity of discrete problems which need to be rectified. Unless events are correlated, a single problem in a single subsystem could result in multiple, uncoordinated corrective actions. This can lead to wasteful resources spent on duplicate efforts and inconsistent corrective actions which result in an escalation of problems.
Conventional and previously proposed approaches to managing faults in a system have failed to fully address the increase in complexity and have failed to provide adequate performance for large systems, as outlined more particularly herein. In order to discuss these problems, it is first necessary to understand these other approaches.
Event correlation and management approaches can be generally grouped into five categories: (1) rule-based reasoning; (2) case-based reasoning; (3) reasoning with generic models; (4) probability networks; and (5) model-based reasoning. In addition, a number of different architectures have been considered to carry out event correlation and management. In order to review these approaches, the following terminology is defined: