1. Technical Field
The present invention is directed to managing a large distributed computer enterprise environment and, more particularly, to correlating system and network events in a system having distributed monitors that use events to convey status changes in monitored objects.
2. Description of the Related Art
Companies now desire to place all of their computing resources on the company network. To this end, it is known to connect computers in a large, geographically-dispersed network environment and to manage such an environment in a distributed manner. One such management framework comprises a server that manages a number of nodes, each of which has a local object database that stores object data specific to the local node. Each managed node typically includes a management framework, comprising a number of management routines, that is capable of a relatively large number (e.g., hundreds) of simultaneous network connections to remote machines. As the number of managed nodes increases, the system maintenance problems also increase, as do the odds of a machine failure or other fault.
The problem is exacerbated in a typical enterprise as the node number rises. Of these nodes, only a small percentage are file servers, name servers, database servers, or anything but end-of-wire or xe2x80x9cendpointxe2x80x9d machines. The majority of the network machines are simple personal computers (xe2x80x9cPC""sxe2x80x9d) or workstations that see little management activity during a normal day.
Thus, as the size of the distributed computing environment increases, it becomes more difficult to centrally monitor system and network events that convey status changes in various monitored objects (e.g., nodes, systems, computers, subsystems, devices and the like). In the prior art, it is known to distribute event monitor devices across machines that are being centrally managed. Such event monitors, however, typically use a full-fledged inference engine to match event data to given conditions sought to be monitored. An xe2x80x9cinference enginexe2x80x9d is a software engine within an expert system that draws conclusions from rules and situational facts. Implementation of the event monitor in this fashion requires significant local system resources (e.g., a large database), which is undesirable. Indeed, as noted above, it is a design goal to use only a lightweight management framework within the endpoint machines being managed.
Thus, there remains a need to provide more efficient event correlation techniques within a distributed computer environment wherein distributed monitors use events to convey status changes in monitored objects within the environment. The present invention solves this problem.
It is thus a primary object of this invention to provide a software component that may be statically or dynamically deployed into a distributed computing environment and then executed within a given execution context to examine and correlate one or more given event streams.
A more particular object of this invention is to deploy a Java-based software agent into a large distributed computing environment, which agent is then dropped into a local runtime environment to correlate a set of event streams.
A more general object of this invention is to correlate events that convey status changes in monitored objects within a distributed computing environment.
It is a further more general object of this invention to correlate events by implementing a set of simple or xe2x80x9clow-levelxe2x80x9d correlation rules, each of which may be useful in recognizing a given pattern of one or more events indicative of a given condition sought to be monitored and/or controlled.
A still further objective of this invention is to facilitate event correlation by optimizing a relatively small set of state machines, each of which implement a given type of correlation rule. The given set of state machines comprise a fast xe2x80x9ccorrelatorxe2x80x9d that inspects events in an event stream and takes some action (or perhaps remains inactive) depending on the inspection.
Another general objective of this invention is to provide resource monitoring across a distributed computer environment.
These and other objects of the invention are provided in a method of event correlation that is preferably implemented within a distributed environment having a management server and a set of managed machines. Individual managed machines may have diverse operating system environments. A given subset of the managed machines include a distributed management infrastructure. In particular, each managed machine in the given subset includes a runtime environment, which is a platform-level service that can load and execute software agents. One or more software agents are deployable within the distributed environment to facilitate management and other control tasks. A particular software agent comprises a tool for examining given event streams, each of which may be evaluated using a simple rule. The runtime environment at a particular node preferably includes a runtime engine, and a distributed monitor (DM) for carrying out monitoring tasks.
The present invention implements event correlation at a local node using the preferably Java-based software component that is deployed as an agent, for example, on demand. The preferred event correlation method operates at a particular managed machine as follows. First, a discrete set of correlation rules is established. One preferred implementation of a correlation rule is a software-based state machine implemented by a software component deployed to the managed machine. Each correlation rule is adapted to recognize a given pattern of one or more events indicative of a given condition. Thus, a set of correlation rules comprise a set of efficiently-coupled state machines, each of which is optimized for a particular, low-level logical function. Then, as events are received and/or generated at the machine, the events are examined by the state machines comprising the correlator to search for the defined event patterns. If a given event pattern is recognized (usually across two or more state machines that have a given relationship), a given condition sought to be monitored has occurred, and the event correlator may then be used (by itself or in association with another utility or routine) to take a given action. That action, for example, may be issuing a control signal to control the software agent to perform some task, to deploy another software agent, or do effect some other action within or without the managed machine.
The particular type of correlation rules implemented by the state machines may be quite varied. In the preferred embodiment, the types of rules are typically limited for ease of use and portability. Thus, a representative set of correlation rules may include a matching rule triggered by an event that satisfies a given search criteria defined in the matching rule. Another type is a duplicate rule triggered by a given event associated with a given condition. Where the duplicate rule is used, the given action includes ignoring the given event for a specified time period after occurrence of the given condition. Another rule type is a pass through rule triggered by a given event sequence. A reset rule is an opposite of a pass through rule and is thus triggered by non-occurrence of a given event sequence. Yet another type of rule is a threshold rule triggered by a specified number of similar events in the event stream. Typically, some defined subset of these rule types is used to derive a particular event correlator for a given managed machine. Thus, for example, the event correlator may be programmed to generate an output (e.g., another event) if the event stream from a first source satisfies a first rule and the event stream from a second source satisfies a second rule. The event correlator is preferably implemented as part of the runtime monitor and thus used to facilitate event monitoring, correlation and control.
At any given node, the defined set of efficiently coupled state machines evaluate patterns of events or traps. As the set of state machines only evaluate certain types of events according to a limited set of rules, the mechanism is very fast and consumes few system resources. As noted above, once an appropriate match is found, i.e. the correct set of events or faults, some given action may be taken.
The foregoing has outlined some of the more pertinent objects of the present invention. These objects should be construed to be merely illustrative of some of the more prominent features and applications of the invention. Many other beneficial results can be attained by applying the disclosed invention in a different manner or modifying the invention as will be described. Accordingly, other objects and a fuller understanding of the invention may be had by referring to the following Detailed Description of the preferred embodiment.