The present invention generally relates to computer systems, and more particularly to monitoring the state of various resources in a computer system.
Network management is the process of controlling a complex data network to maximize its efficiency and productivity. Network management can include, for example, the process of locating problems, or faults, on a data network. It can also include measuring the performance of network hardware, software, and media. Individual and group user""s utilization of network resources can be tracked to better ensure that users have sufficient resources.
As networks become increasingly more sizable and complex, the need has grown for centralized and simplified network management. Existing network administrators often manually verify or run scripts to evaluate the state (i.e., health) of resources on network machines, e.g., to inspect free disk space to see if disk space is getting low, or CPU utilization to see if loads need to be distributed to other machines. The information is maintained in various databases and flat tables that are separate, unstructured, and difficult to evaluate. There exists a need for a more sophisticated and organized way to model, access and evaluate machine data in order to monitor the health of resources on a network.
The present invention provides a hierarchical, object-oriented definition of the health of resources on a machine or network. To this end, a schema of objects is defined to represent machine and/or network health. One class of objects defines data groups (DataGroups) that can be nested to represent machine resources. DataGroups act essentially as folders to contain DataCollectors. For example, one DataGroup object may represent machine software, with DataGroups under it representing processes and services. Another DataGroup may represent machine hardware, with other DataGroups, hierarchically below the machine hardware DataGroup, representing one or more disk drives, processors, memory and so forth.
In addition to DataGroup objects, three other types of objects are defined in the schema, namely DataCollectors, Thresholds and Actions. DataCollectors are objects that when instantiated, collect data from a DataGroup resource (e.g., via Microsoft(copyright) Corporation""s Windows Management Instrumentation (WMI)) by polling or event notification. For example, there can be a DataCollector for LogicalDisk information that collects the information about a logical disk that is to be monitored. DataCollector properties include a collection interval, i.e., how often to go ask for the data, scheduling, e.g., the days and hours that the DataCollector is active, and path information, i.e., where to obtain the data, and how to obtain the data (via polling or event based notification). DataCollectors may specify WMI queries in their respective properties, or specify WMI polled methods.
Threshold class objects as defined in the schema are associated with the DataCollector objects, and each essentially provides the threshold or thresholds (rules) against which a DataCollector""s collected data is evaluated. Threshold properties specify how to use the collected data, e.g., its current value, average value, or a difference between collections. Another property specifies the test condition to apply, such as less than, greater than, equal, not equal, contains or does not contain. Other Threshold properties may be used to specify a duration specifying how long a value should remain in violation of a Threshold before it is considered an actual violation.
Dynamic multi-instance thresholding is also provided, wherein a DataCollector will go out and dynamically discover resources, (e.g., running processes), go out and collect the resources"" data, and apply Thresholds to the collected data. As a result, prior knowledge of resources (e.g., the identity of the processes running at any given time) is not required to monitor those resources.
Action objects specify what happens when a Threshold is violated, such as to send an e-mail, page someone, run a script, execute a command, or send a message alert to a console for an operator to see. Actions may be throttled, for example, to limit the paging of someone to only once per hour. Action objects may be attached anywhere in the hierarchy, and threshold violation events roll-up the hierarchy to trigger the actions. Thus, for example, a first action can be set so as to page a first administrator on detection of low disk space, while a second action may be set at the system hardware level (hierarchically above the disk level) so as to e-mail a second administrator of the problem, whereby the single threshold violation of the specified Threshold cause an event that triggers both Actions.