The present invention relates generally to network and systems management and, more particularly, to detecting and resolving availability and performance problems.
With the dramatic decline in the price of hardware and software, the cost of ownership for computing devices is increasingly dominated by network and systems management. Included here are tasks such as establishing configurations, help desk support, distributing software, and ensuring the availability and performance of vital services. The latter is particularly important since inaccessible and/or slow services decrease revenues and degrade productivity.
The first step in managing availability and performance is event management. Almost all computing devices have a capability whereby the onset of an exceptional condition results in the generation of a message so that potential problems are detected before they lead to widespread service degradation. Such exceptional conditions are referred to as xe2x80x9cevents.xe2x80x9d Examples of situations in which events are generated include: unreachable destinations, excessive CPU consumption, and duplicate IP addresses. An event message contains multiple attributes, especially: (a) the source of the event, (b) type of event, and (c) the time at which the event was generated.
Event messages are sent to an xe2x80x9cevent management system (EMS).xe2x80x9d In existing art, such systems are policy-driven, which means that external descriptions are used to specify the event patterns for which actions are taken. Thus, an EMS has separate subsystems for policy execution and policy authoring. The latter provides a means for the operations staff to construct policies. The former provides for the processing of event messages. In existing art, an EMS has repositories for policies, events, and configuration information used in event management.
Upon arrival of an event message, the policy execution system parses the message to translate it into a normalized form (e.g., by isolating fields instead of having a single text string). This normalized information is then placed into an event repository. Next, the normalized event is fed into a xe2x80x9ccorrelation enginexe2x80x9d that processes events as specified by operational policies that address considerations such as:
1. Elimination of duplicate messages. Duplicate is interpreted broadly here. For example, if multiple hosts on the same local area network generate a destination unreachable message for the same destination, then the events contain the same information.
2. Maintenance of operationial state. State may be as simple as which devices are up and which are down. It may be more complex as well, especially for devices that have many intermediate states or special kinds of error conditions (e.g., printers).
3. Problem detection. A problem is present if the services cannot be delivered in accordance with a service level agreement (which may be formal or informal). This could be the result of a device failure, exceeding some internal limit (e.g., buffer capacity), or excessive resource demands.
4. Problem isolation. This involves determining the components that are causing the problem. For example, distributing a new release of an application that has software errors can result in problems for all end-users connecting to servers with the updated application.
Items (1) and (2) are, in some sense, intermediate steps to (3) and (4). Thus, we focus on the latter two.
The correlation engine provides automation that is essential for delivering cost effective management of complex computing environments. Existing art provides three kinds of correlation. The first employs operational policies expressed as rules, e.g., K. R. Milliken et al., xe2x80x9cYES/MVS and the Automation of Operations for Large Computer Complexes,xe2x80x9d IBM Systems Journal, Vol 25, No. 2, 1986. Rules are if-then statements in which the if-part tests the values of attributes of individual events, and the then-part specifies actions to take. An example of such a rule is xe2x80x9cIf multiple hosts on the same LAN cannot reach the same destination, then alert the operator that there is a connectivity problem from the LAN to the destination.xe2x80x9d The industry experience has been that such rules are difficult to construct, especially if they include installation-specific information.
Another approach has been developed by SMARTS, see, e.g., SMARTS, xe2x80x9cAbout Code Book,xe2x80x9d http://www.smarts.com/codebook.html, 1999. SMARTS is based on the concept of a codebook that matches a repertoire of known problems with event sequences observed during operation. Here, operational policies are models of problems and symptoms. Thus, accommodating new problems requires properly modeling their symptoms and incorporating their signatures into the code book. In theory, this approach can accommodate installation-specific problems. However, doing so in practice is difficult because of the high level of sophistication required. Further, the SMARTS technology only applies to known problems.
Recently, a third approach to event correlation has been proposed by Computer Associates International, see, e.g., Computer Associates International, xe2x80x9cNeugents. The Software that can Think,xe2x80x9d Jul. 16, 1999, http://www.cai.com/neugents. This approach trains a neural network to predict future occurrences of events based on the frequency of their occurrence in historical data. Typically, events are specified based on thresholds such as, for example, CPU utilization exceeding 90%. The policy execution system uses the neural network to determine the likelihood of one of the previously specified events occurring at some time in the future. While this technique can provide advanced knowledge of the occurrence of an event, it still requires specifying the events themselves. At a minimum, such a specification requires detailing the following:
1. The variable measured (e.g., CPU utilization);
2. The directional change considered (e.g., too large); and
3. The threshold value (e.g., 90%).
The last item can be obtained automatically from examining representative historical data. Further, graphical user interfaces can provide a means to input the information in items (2) and (3). However, it is often very difficult for installations to choose which variables should be measured and the directional change that constitutes an exceptional situation.
To summarize, existing art uses a micro approach to event correlation. That is, existing correlation engines analyze individual events and their interrelationships. While such an approach has value, it has severe limitations as well. Foremost, existing art requires an expert to develop the operational policies that drive the analysis. As a result, it is difficult for installations to define and maintain customized operational policies.
The present invention provides systems and methods to simplify and customize the automation of event management. The invention is based on at least the following observation: big problems generate lots of events. This observation suggests a macro approach to event correlation that focuses on the rate at which events are generated rather than their detailed interrelationships.
To illustrate our approach, consider a connectivity problem that occurs between hosts on subnet 82.13.16 and the host 93.16.12.54. Existing art would detect such problems by having rules that examine the event type (xe2x80x9cdestination unreachablexe2x80x9d) and identify that the hosts generating this message are on the same subnet. In contrast, the present invention detects such problems based on the rate at which messages are generated by hosts on the subnet. An event rate threshold is obtained from historical data. If the rate exceeds this threshold, then an alarm is raised. This leads to the rule: xe2x80x9cIf event rates on a LAN exceed the LAN-specific threshold, raise an alarm.xe2x80x9d
Once a problem is detected, event rates provide a way to diagnosis the problem. This is achieved by exploiting the structure of the attributes of events. Consider the example in the preceding paragraph. Once an excessive event rate is detected, we want to know its cause. This can be achieved by further classifying events based on their attributes, such as event type, the kind of host (e.g., file server, domain name server), and time of day. In the example, we find that the increased event rates can be attributed to events with the type xe2x80x9cdestination unreachable.xe2x80x9d This information is obtained through automation that looks for common characteristics among events based on hierarchies of event attributes. Examples of such hierarchies include: time hierarchy, comprising hours, minutes, and seconds; and configuration hierarchy, comprising campus, subnet, and host. The present invention provides systems and methods for merging individual attribute hierarchies into a single event hierarchy. Given this merged hierarchy, techniques such as those described in U.S. Pat. No. 5,996,090 to Joseph L. Hellerstein entitled xe2x80x9cMethod and Apparatus for Quantitative Diagnosis of Performance Problems Using External Representations,xe2x80x9d the disclosure of which is incorporated by reference herein, can be applied to obtain a quantitative diagnosis for the cause of large event rates.
Event rate analysis uses threshold tests on event rates. As such, constructing event rate policies requires specifying: (a) the set of events to consider, (b) the directional change of interest, and (c) threshold values for event rates. Based on the premise of the analysis of the invention, the directional change of interest for (b) is larger event rates (although the invention is sufficiently flexible to accommodate other kinds of threshold violations as well). Item (c) can be obtained from representative historical data once (b) is specified. Thus, to specify detection policies using event rates only requires describing the set of events that are to be counted in the rates.
We use the term xe2x80x9cevent groupxe2x80x9d to describe a collection of events that are used to compute an event rate. Thus, in the correlation component described herein there is an event grouping component that is responsible for identifying the group or groups to which an event belongs. We use the term xe2x80x9cevent group descriptorxe2x80x9d to indicate a way of specifying the events that are members of an event group. The present invention employs event group descriptors akin to a where-clause in a Structured Query Language (SQL) query. Examples of event group descriptors include: events from the same subnet, events with type xe2x80x9cdestination unreachable,xe2x80x9d and events generated within a 15 second interval.
By employing event group descriptors, we can automate the construction of the if-part of event rate policies. This is sufficient in many cases since the then-part often just consists of sending a message to the operator. The general form for the policies we consider is:
If rate of event-group-1 violates threshold-1 and . . . rate of event-group-N violates threshold-N, then . . .
Note that this is equivalent to generating new events for each threshold violated and then employing a traditional rule-based system that tests for each of these events. Further, note that xe2x80x9cviolates a thresholdxe2x80x9d is intended to be interpreted broadly to mean that the event rate violates a predetermined event rate criterion. For example, the event rate may be too large, or too small when compared to a threshold value, or the event rate may fall outside an interval or lie within an interval when compared to a threshold range.
Thus, the burden that the present invention imposes on the operations staff is to specify the event group descriptors. Although this is easier to do than the requirements of existing art, we provide techniques that further simplify this task. Our observation is that the appropriate way to form groups of events is, in part, determined by information about the computer installation. For example, it is natural to group events based on the segment, LAN, and campus from which they emanate. This information forms a hierarchy for grouping hosts and hence events. Many installations have such information in a configuration database. Thus, it is straightforward to construct an engine that processes this data into a hierarchy of attribute values. Further, there are many such information sources, including host inventory (e.g., choice of OS (operating system), OS release level, OS patch level) and event type. In addition, note that given these hierarchies, it is straightforward to isolate the cause of excessive event rates using techniques such as those in the above-referenced U.S. Pat. No. 5,996,090.
The present invention comprises two interrelated systems. The first is a correlation engine that executes event rate policies. The second is an authoring system whereby event rate policies are specified. These policies may have the following components:
(1) Event group descriptors, which specify the conditions for membership in an event group;
(2) Event group thresholds, which quantify what constitutes an excessive event rate for an event group;
(3) Event group hierarchies, which provide a generalization-specialization hierarchy for event groups;
(4) Event group actions, which detail the tasks to execute when the if-part of an event rate policy is satisfied.
The last component is well known art and so is not addressed in detail.
The correlation engine that executes policies that use event rates may have the following elements:
(1) grouping engine, which determines the groups to which events belong;
(2) rate detector, which determines if the rate of events for an event-group exceeds its threshold;
(3) rate diagnoser, which uses event group hierarchies to isolate the cause of excessive event rates.
The method for the correlation engine of our invention may have two parts. The first concerns the arrival of a new event, which includes the steps: (a) identifying the event groups to which an event belongs; and (b) incrementing counts for the identified groups. The second is a task that is executed periodically to check event rates and to perform diagnosis for those event groups that have excessive rates.
The authoring system in the present invention may have components for:
(1) an administrative interface that aids in constructing event group descriptors, selecting thresholds, and specifying event group hierarchies;
(2) an event group and hierarchy builder that provides a way to automate the construction of group descriptors and group hierarchies; and
(3) a threshold constructor that provides automation for estimating thresholds.
The methods for the authoring system involve end-user interactions that combine automated construction of event rate policies, event group hierarchies, and event rate thresholds with manual updates to adjust what the automation produces.
Event rates have been employed in various ways. In U.S. Pat. No. 4,325,122 to Parks et al., an application to wind prospecting and an apparatus that efficiently integrates event counts is described. In U.S. Pat. No. 5,761,411 to Teague et al., ways to predict disk failures based on disk errors (or events) are described. U.S. Pat. No. 5,402,412 to Duffie et al. describes means for monitoring events so that they do not exceed a pre-specified rate for each user. However, none of this art concerns itself with correlation engines for event management. Nor does any of this art address the execution or authoring of event rate policies.
More specifically, event rates have been used within network and systems management. For example, Jia Jiao et al., xe2x80x9cMinimizing the Monitoring Cost in Network Management,xe2x80x9d Integrated Network Management VI, IFIP, pp. 155-170, 1999, describes a scheme whereby polling rates are adjusted based on the rate at which events are received. M. Iguchi and S. Goto, xe2x80x9cDetecting Malicious Activities Through Port Profiling,xe2x80x9d IEICE Trans. Inf. Syst., Vol. E82-D, No. 4, pp. 784-92, April 1999, disclose a way to detect malicious users using event rates. However, in neither case are the event rates used in operational policies. And, in neither art is there an authoring system through which administrators construct installation-specific policies aided by automation that exploits operational information such as topology and inventory.
There are at least two areas in which the present invention provides benefits. The first relates to customized event management. In existing art, providing installation customization requires specifying the events of interest (e.g., xe2x80x9cunreachable destination,xe2x80x9d xe2x80x9cping timeoutxe2x80x9d) and their relationships (e.g., the unreachable host does not respond to a ping). Such an approach requires considerable expertise on the part of the operations staff, a requirement that is hard to satisfy given the dearth of experts. The present invention greatly reduces the expertise required to specify operational policies for problem detection and diagnosis based on the use of event rates. The inventive systems and methods for execution of event rate policies only require specifying event groups of interest (e.g., hosts that are on the same LAN). Further, with the inventive systems and methods for authoring event rate policies, event groups can be specified automatically based on primary information sources such as topology and inventory information.
Another benefit of the invention is that problem detection and isolation can be done for situations that are not known a priori. Existing art focuses on specific problems, such as IP (Internet Protocol) connectivity and configuration errors. This is done by looking for event sequences that are signatures of these problem types. In contrast, the present invention provides systems and methods to address problems without prior knowledge of their characteristics if they are manifested by a change in event rate. Our experience with production systems has shown that problems as diverse as router configuration errors, invalid hub programs, and security intrusions can all be detected through changes in event rates.
We note in passing that the present invention may be a complement to existing art in addition to a replacement for it. Clearly, it is desirable to use prior knowledge of problems when this knowledge exists (and is fairly static). The invention extends the capability of event management automation to increase customization and to address the detection and isolation of unknown problems.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.