The present invention relates generally to network and systems management and, more particularly, to techniques for generating correlation rules for use in detecting and resolving availability and performance problems.
With the dramatic decline in the price of hardware and software, the cost of ownership for computing devices is increasingly dominated by network and systems management. Included here are tasks such as establishing configurations, help desk support, distributing software, and ensuring the availability and performance of vital services. The latter is particularly important since inaccessible and/or slow services decrease revenues and degrade productivity.
The first step in managing availability and performance is event management. Almost all computing devices have a capability whereby the onset of an exceptional condition results in the generation of a message so that potential problems are detected before they lead to widespread service degradation. Such exceptional conditions are referred to as xe2x80x9cevents.xe2x80x9d Examples of events include: unreachable destinations, excessive central processing unit (CPU) consumption, and duplicate Internet Protocol (IP) addresses. An event message contains multiple attributes, for example: (a) the source of the event; (b) type of event; and (c) the time at which the event was generated.
Event messages are sent to an xe2x80x9cevent management system (EMS).xe2x80x9d An EMS has an xe2x80x9cadaptorxe2x80x9d that parses the event message and translates it into a normalized form. This normalized information is then placed into an xe2x80x9cevent database.xe2x80x9d Next, the normalized event is fed into a xe2x80x9ccorrelation enginexe2x80x9d that determines actions to be taken. This determination is typically driven by correlation rules that are kept in a xe2x80x9crule database.xe2x80x9d Examples of processing done by correlation rules includes:
1. Elimination of duplicate messages. xe2x80x9cDuplicatexe2x80x9d is interpreted broadly here. For example, if multiple hosts on the same local area network generate a destination-unreachable message for the same destination, then the events contain the same information.
2. Maintenance of operational state. xe2x80x9cStatexe2x80x9d may be as simple as which devices are up (e.g., operating) and which are down (e.g., not operating). It may be more complex as well, especially for devices that have many intermediate states or special kinds of error conditions (e.g., printers).
3. Problem detection. A problem is present if one or more components of the system are not functioning properly. For example, the controller in a load balancing system may fail in a way so that new requests are always routed to the same back-end web server, a situation that can be tolerated at low loads but can lead to service degradation at a high load. Providing early detection of such situations is important in order to ensure that problems do not lead to widespread service disruptions.
4. Problem isolation. This involves determining the components that are causing the problem. For example, distributing a new release of an application that has software errors can result in problems for all end-users connecting to servers with the updated application. Other examples of causes of problems include: device failure, exceeding some internal limit (e.g., buffer capacity), and excessive resource demands.
The correlation engine provides automation that is essential for delivering cost effective management of complex computing environments. Existing art provides three kinds of correlation. The first employs operational policies expressed as rules, see, e.g., K. R. Milliken et al., xe2x80x9cYES/MVS and the Automation of Operations for Large Computer Complexes,xe2x80x9d IBM Systems Journal, vol. 25, no. 2, 1986. Rules are if-then statements in which the if-part tests the values of attributes of individual events, and the then-part specifies actions to take. An example of such a rule is: xe2x80x9cIf a hub generates an excessive number of interface-down events, then check if the software loaded on the hub is compatible with its hardware release.xe2x80x9d The industry experience has been that such rules are difficult to construct, especially if they include installation-specific information.
Another approach has been developed by SMARTS (Systems Management Arts) based on the concept of a code book that matches a repertoire of known problems with event sequences observed during operation. This is described in U.S. Pat. No. 5,661,668 issued to Yemini et al. on Aug. 26, 1997 and entitled xe2x80x9cApparatus and Method for Analyzing and Correlating Events in a System Using a Causality Matrix.xe2x80x9d Here, operational policies are models of problems and symptoms. Thus, accommodating new problems requires properly modeling their symptoms and incorporating their signatures into a code book. In theory, this approach can accommodate installation-specific problems. However, doing so in practice is difficult because of the high level of sophistication required to encode installation-specific knowledge into rules.
Recently, a third approach to event correlation has been proposed by Computer Associates International called xe2x80x9cNeugents.xe2x80x9d This approach trains a neural network to predict future occurrences of events based on factors characterizing their occurrence in historical data. Typically, events are specified based on thresholds, such as CPU utilization exceeding 90%. The policy execution system uses the neural network to determine the likelihood of one of the previously specified events occurring at some time in the future. While this technique can provide advanced knowledge of the occurrence of an event, it still requires specifying the events themselves. At a minimum, such a specification requires detailing the following:
1. The variable measured (e.g., CPU utilization);
2. The directional change considered (e.g., too large); and
3. The threshold value (e.g., 90%).
The last item can be obtained automatically from examining representative historical data. Further, graphical user interfaces can provide a mechanism to input the information in items (2) and (3). However, it is often very difficult for installations to choose which variables should be measured and the directional change that constitutes an exceptional situation.
To summarize, the above-described existing art for event management systems is of three types. The first type (e.g., as in the K. R. Milliken et al. article, 1986) requires that correlation rules be specified by experts, a process that is time-consuming and expensive. The second type (e.g., as in the Yemini et al. patent) reduces the involvement of experts but only for aspects of event management that share broad commonalties (e.g., IP connectivity). The third type (e.g., Computer Associates International""s Neugent software, 1999) attempts to automate the construction of correlation rules for a broader range of management areas. However, to date, this has not been done in a manner that provides for customization by experts, especially in a way that avoids dealing with low-level details (e.g., specific threshold values, the choice of measurement values, and directional changes of interest for these variables).
Other work relating to the construction of correlation rules includes: (a) statistical process control, which provides for a way to set baseline levels of continuously operating machines, e.g., D. M. Thompson et al., xe2x80x9cExamination of the Potential Role of the Internet in Distributed SPC and Quality Systems,xe2x80x9d Quality and Reliability Engineering International, vol. 16, no. 1, 2000; (b) visual programming for rule-base systems, which overcomes some of the syntactic problems of rule construction, e.g., W. Mueller et al., xe2x80x9cA Visual Framework for the Scripting of Parallel Agents,xe2x80x9d IEEE International Symposium on Visual Languages,xe2x80x9d Seattle, Wash., September 2000; and (c) event management design, which provides a process driven by human experts to construct correlation rules, e.g., D. Thoenen et al., xe2x80x9cEvent Relationship Networks: A Framework for Action Oriented Analysis in Event Management,xe2x80x9d IBM Research Report RC 21843, October 2000.
The present invention addresses the problem of decision support for constructing correlation rules for event management. More specifically, the invention provides techniques for systematically processing historical event data in accordance with an event cache to extract correlation rules.
In one aspect of the invention, a technique for systematically constructing one or more correlation rules for use by an event management system for managing a network with one or more computing devices comprises the following steps. First, in association with an event cache, event data representing past events associated with the network of computing devices being managed by the event management system is obtained. For example, this may involve reading the past or historical event data from an event repository into the event cache, or having the event cache simply point to the event data in the event repository. Next, a first pattern is found or detected in the obtained event data associated with the event cache. The pattern therefore includes one or more events in the obtained event data. The pattern is then classified. For example, the pattern may be classified as normal or abnormal. Then, at least one correlation rule is constructed based on the classified pattern. Lastly, in association with the event cache, the one or more events included in the pattern are replaced with a composite or cumulative event such that hierarchical patterns may be subsequently found for use in constructing further correlation rules. The composite event represents the individual events comprising the pattern. The constructed correlation rule may then be stored in a rule database for access by the event management system.
In one illustrative embodiment, the correlation rule constructing step may comprise the steps of automatically learning at least one predicate of the correlation rule from the pattern found, and then adding at least one corresponding action to the automatically learned predicate, based on the classifying step, to form the correlation rule. This automatic learning process may also utilize positive examples and negative examples of the one or more events included in a detected and classified pattern.
Further, the step of finding a pattern in the event data of the event cache may comprise a user marking the event pattern in accordance with a data visualization of at least a portion of the event data associated with the event cache. In another embodiment, the step may comprise employing a data mining algorithm.
Advantageously, the steps of finding the pattern, classifying the pattern, constructing the rule and replacing the events in the pattern with a composite event may be repeated until all the event data associated with the event cache is considered. In this manner, the past or historical data in the event cache is systematically processed such that a more comprehensive set of correlation rules can be constructed. Such inventive techniques have several advantages. First, for example, rules are constructed for patterns that actually exist. Second, for example, situations that experts may be unaware of are discovered since patterns in historical data are revealed in a systematic way.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.