High quality event management has long been seen as the cornerstone of a healthy business and Information Technology (IT) operation environment. As every business is becoming an electronic business (e-business), the demand from IT service customers has evolved from reactive management toward proactive management. Enormous academic research and commercial products have attempted to achieve proactive management by root cause analysis (RCA). However, what RCA can provide does not match well with the needs of two primary goals of event management:
(1) Rapid detection of, and a fast response to, exceptional situations; and
(2) Precise and accurate identification of the problem scope (hosts, networks, people, etc.).
In response to these real-world operational demands, a new paradigm referred to as action-oriented analysis (AOA) has recently been proposed, see, e.g., Thoenen et al., “Event Relationship Networks: A Framework for Action Oriented Analysis for Event Management,” International Symposium on Integrated Network Management, 2001, the disclosure of which is incorporated by reference herein. The concepts of AOA is concretized as the Event Management Design (EMD) methodology which contains four activities:
(1) Select the event sources;
(2) Take inventory of all events;
(3) Document event policy and processing decisions; and
(4) Construct Event Relationship Networks (ERNs) for correlation analysis.
By examining these activities, we can see that activity (1) is relatively straightforward for system administrators since important event sources (e.g., Unix servers, NT servers, NetWare Severs, hubs, routers, ATM switches, UPS systems, applications, web servers, database servers, etc.) are very easy to identify. Activity (2) mostly relies on the quality and coverage of service providers' event source repertoires and their quality of knowledge management. Activity (3) involves customizing policy specifications and making processing decisions for the particular operation environment based on its special requirements. Activity (4) involves constructing ERNs, an ERN being a graphical representation of how events are correlated.
IBM Global Service has developed a toolset that translates a set of ERNs along with a default action template to event correlation rules ready to be used in event correlation engines like the Tivoli Enterprise Console. Therefore, activity (4) is the pivotal step of the EMD methodology. Proportional to the significance, our experience shows activity (4) usually requires the most time and domain expertise.
ERN construction can be significantly sped up if the service providers have corresponding ERNs as their intellectual capital. However, there are roughly 11,000 types of event sources currently working in business environments that might be taken in event management. Considering the tremendous diversity of event sources, such advantage should not be expected. Furthermore, the same type of event sources may be configured very differently in different operation environments. Also, the decisions about event processing policies may invalidate ERNs constructed under different policies.
These constraints indicate that revising and constructing ERNs are unavoidable in most cases. Consider a typical operation environment containing 20 event sources and 100 enterprise significant event types for each event source. Domain and device experts have to mentally figure out all the autonomous events among the 2000 event types and the correlations among the rest, and document them into ERNs. The time and cost that have to be spent on constructing ERNs is significant.
Beside the cost of constructing ERNs, the correctness and effectiveness of ERNs also have a great impact on the performance of event management. On one hand, incomplete ERNs cause correlation engines to fail to correlate events that are “symptoms” of the same “problem” and initiate more than enough notifications or actions, thus, deteriorating the second goal of event management. On the other hand, incorrect ERNs cause correlation engines to fail to take proper action or notify the correct people, thus, violating the first goal of event management. Worst of all, ERNs can be both incomplete and incorrect. The need of a method to validate and construct ERNs based on true and complete correlations is apparent.