1. Technical Field
The present invention is directed to managing a large distributed computer enterprise environment and, more particularly, to correlating system and network events in a system having distributed monitors that use events to convey status changes in monitored objects.
2. Description of the Related Art
Companies now desire to place all of their computing resources on the company network. To this end, it is known to connect computers in a large, geographically-dispersed network environment and to manage such an environment in a distributed manner. One such management framework comprises a server that manages a number of nodes, each of which has a local object database that stores object data specific to the local node. Each managed node typically includes a management framework, comprising a number of management routines, that is capable of a relatively large number (e.g., hundreds) of simultaneous network connections to remote machines. As the number of managed nodes increases, the system maintenance problems also increase, as do the odds of a machine failure or other fault.
The problem is exacerbated in a typical enterprise as the node number rises. Of these nodes, only a small percentage are file servers, name servers, database servers, or anything but end-of-wire or xe2x80x9cendpointxe2x80x9d machines. The majority of the network machines are simple personal computers (xe2x80x9cPC""sxe2x80x9d) or workstations that see little management activity during a normal day.
Thus, as the size of the distributed computing environment increases, it becomes more difficult to centrally monitor system and network events that convey status changes in various monitored objects (e.g., nodes, systems, computers, subsystems, devices and the like). In the prior art, it is known to distribute event monitor devices across machines that are being centrally managed. Such event monitors, however, typically use a full-fledged inference engine to match event data to given conditions sought to be monitored. An xe2x80x9cinference enginexe2x80x9d is a software engine within an expert system that draws conclusions from rules and situational facts. Implementation of the event monitor in this fashion requires significant local system resources (e.g., a large database), which is undesirable. Indeed, as noted above, it is a design goal to use only a lightweight management framework within the endpoint machines being managed.
Prior art techniques have several other significant disadvantages. One problem is lack of scalability. As the number of connected nodes increases, it has not been possible for an administrator to easily add monitoring capabilities to an appropriate subset of the endpoints with minimal effort. Even when the monitoring application can be configured, it may not operate appropriately under peak conditions. Another significant problem is that local monitors do not have sufficient built-in response capability. In large distributed systems, it is often insufficient to note merely that a monitored value of a particular resource is out of tolerance. Whenever possible, a local attempt to correct the situation must be made. Known systems do not have adequate local response capability. Moreover, some errors have no local remedy and, in those cases, the response must have a corresponding remote action that can be triggered by the client error.
The prior art has not adequately addressed these and other problems. Thus, there remains a need to provide more efficient monitoring techniques within a distributed computer environment wherein distributed monitors use events to convey status changes in monitored objects within the environment.
It is thus a primary object of this invention to provide distributed monitoring of resources within a distributed computing environment.
It is another primary object of this invention to implement a distributed monitor runtime environment at given nodes in a large distributed computer network to facilitate the task of resource monitoring.
It is still another important object of the present invention to provide a robust event-driven control mechanism for correcting out-of-tolerance conditions identified with respect to resources being monitored in a local network system.
It is yet another object of the present invention to facilitate addition of monitoring capabilities to new endpoint machines in a large computer network as the network is scaled.
A more general object of this invention is to provide resource monitoring across a distributed computer environment.
These and other objects of the invention are provided in a method of monitoring implemented within a distributed environment having a management server and a set of managed machines. A given subset of the managed machines include a distributed management infrastructure. In particular, each managed machine in the given subset includes a runtime environment, which is a platform-level service that can load and execute software agents. One or more software agents are deployable within the distributed environment to facilitate management and other control tasks. The runtime environment at a particular node includes a runtime engine, and a distributed monitor (DM) for carrying out monitoring tasks.
A representative monitoring operation involves making a measurement, comparing the measured value against threshold(s), and performing a response for out-of-tolerance conditions. According to the present invention, a monitoring agent may be triggered to run via a timer or upon satisfaction of a given correlation condition. An event correlator is used to determine whether the given correlation condition has been met.
In accordance with one aspect of the invention, there is described a method of monitoring in a distributed computer network having a management server servicing a set of managed computers. The method begins by deploying a management infrastructure across a given subset of the managed computers, the management infrastructure comprising a runtime environment installed at a given managed computer. At the given managed computer, the routine executes a monitoring agent in the runtime environment to determine whether a given threshold has been exceeded. Then, a given action is taken if the given threshold has been exceeded. The monitoring agent is executed upon receipt of an external event or as a result of an internal timer. Execution of the monitoring agent involves taking a measurement, comparing the measurement against the given threshold, and then taking some corrective action if possible.
Another aspect of the present invention is a method of monitoring in a distributed computer network having a set of managed computers, wherein a management infrastructure is deployed across a given subset of the managed computers and comprises a runtime environment installed at a given managed computer. The method begins by establishing an event class registration list at a given managed computer. Upon receipt of an event having an event class associated therewith, the routine then examines the registration list to determine whether a given monitoring task has expressed interest in the event class. If so, the event is processed through a correlator. Then, a given action is taken (e.g., executing the given monitoring task) if a condition expressed in a correlation rule associated with the monitoring task has been met. The given monitoring task may include a response function to attempt to correct the condition that triggered the task.
Another aspect of this invention is a monitor system for use in a managed machine connected in a distributed computer network. The monitor system comprises a runtime engine, and an event correlator/router executable in the runtime engine and responsive to an event stream to determine whether a set of one or more events satisfying a given correlation condition have been received. At least one monitor task is also executable in the runtime engine upon satisfaction of the given correlation condition to effect monitoring of a managed local resource. The monitor task may also implement a correction task using the runtime engine or other local resources.
The foregoing has outlined some of the more pertinent objects of the present invention. These objects should be construed to be merely illustrative of some of the more prominent features and applications of the invention. Many other beneficial results can be attained by applying the disclosed invention in a different manner or modifying the invention as will be described. Accordingly, other objects and a fuller understanding of the invention may be had by referring to the following Detailed Description of the preferred embodiment.