1. Technical Field
The present invention relates in general to distributed systems, and in particular to a method and system for determining conditions of interest on distributed systems. Still more particularly, the present invention relates to a method and system for automated determination of conditions of interest on distributed systems.
2. Description of the Related Art
A distributed system is a type of network with a number of separate individual components. In a distributed system the components are typically input/output locations which are connected to each other or to a common central server. A local area network (LAN) is an example of a distributed system having computer terminals connected to a server which is typically a data processing system. Those skilled in the art understand that many other forms of distributed systems exists which may be similar to or different from the traditional LAN.
Detection of conditions of interest within a distributed system is one of the primary tasks of a system monitoring/management application. The architecture of most of these applications utilizes data collection components at various nodes that gather information. These components package the information as events, then sends them to regional monitoring components. The monitoring components are programmed with criteria based on the event information, which detect conditions of interest in the system. Often times, when a condition of interest is detected, an appropriate set of actions are performed. For example, the condition of interest may include detecting a hack attack on the system. The events which indicate a hack attack are logged and the system or system administrator undertakes some reactive measure. Thus, since the monitoring components are waiting for events, the monitoring components are reactive in nature.
A distributed system is comprised of a set of networked nodes. On many of these nodes, data collection components gather information of interest about the system. Some of this information is used to diagnose system problems. Often times, system-wide problem diagnosis depends upon information generated from multiple nodes in the system. In this case, the information must be accessible in a central location for analysis since comparison of data must occur at the same node. Once collected, the information is sent to this central location for system-wide analysis. Here, a set of criteria that specifies conditions of interest is computed against the information received from the nodes in the system. If a match occurs, a condition of interest has been detected.
Conditions of interest can be time-dependent. For example, if a data item had a certain value yesterday that met part of the criteria for a condition of interest, but has a new value today that does not, yesterday""s data is irrelevant and is not considered for today""s condition detection. A time window is used for eliminating this prior information. So, for time-dependent criteria, a time window is specified as part of the criteria.
For clarity, the central location is referred to as the condition detection component throughout this disclosure. Also, the information transport mechanism is assumed to be via events sent from the data collection components to the condition detection component. This assumption is made to facilitate integration between the new condition detection model and the current or prior event monitoring system architectures since most use events as the event transport mechanisms.
FIG. 1 depicts a basic distributed condition detection system with three nodes, node1107, node2109, and node3101. Node1107 and node2109 contain data collection components. On model 107, there are two instances of object type A 111 and 113 and one instance of object type B 115. On node2109, there is one instance of object types A 111, object type C 117 and object type D 119, respectively. An object type may be any hardware resource, such as a central processing unit (CPU), or router. Object types may also be software applications modelled as an object or user. Node 3101 contains the condition detection component 105. As can be seen, all data collection components send their information to the condition detection component 105. Here, a set of condition criteria 103 is computed against the data received from the collection components to detect conditions of interest.
The conventional approach to system wide condition detection is centered around event processing. Data is generated at distributed nodes, then formatted as events and sent to the condition detection component as previously described. The received event is now referred to as the event under analysis. When the event is received by the condition detection component, it is processed by a sequence of event filters. Each event filter is based on the event type. That is, a filter written for an event of type A (i.e., A1, A2. . . AN) will fire when the event under analysis is of type A. Also, a given filter may also fire on multiple event types. Each event filter may have an associated attribute filter. If the event passes the event filter, the attribute filter is tested. If the attributes in the event under analysis match the criteria in the event, the rule fires. When a rule fires, appropriate actions are performed. An example of a rule is an equation such as A greater than B, while an event is a change of an attribute in the object. An attribute is a CPU threshold data in the object. After, the event under analysis is processed by all event filters, it is added to an event cache. This cache stores the time/history for the information gathered from the collection components. FIG. 2 depicts the conventional condition detection approach.
The basic makeup and components of FIG. 2 are similar to and described within FIG. 1; however, in FIG. 2, event filters 108 of the condition detection component 106 are illustrated. Whenever a new event is received by the condition detection component 106, a condition of interest may have occurred within the system. The first step for determining if a condition of interest has occurred is to process the received event through the event/attribute filters 108. If a rule fires, some part of the complex condition has been detected. The remaining part of the condition must be evaluated. This is accomplished by searching the event cache for events that satisfy event type and attribute value criteria. If multiple event types are involved in the condition, multiple searches need to be performed.
As an example of the current art, assume a condition of interest as follows:
(A.val greater than 5 and C.val greater than 4 ) and (A.node equal C.node)
This condition says that a condition of interest has occurred when events of type A and C are received from the same node and that event A""s attribute value is greater than 5 and that event C""s attribute value is greater than 4. For this condition to be detected the following criteria must be specified.
Thus, to detect the subject condition for related events, two criteria must be specified, one for each event type received. Then, each action must perform a cache search for its related event. Inspection of the original problem shows the original concisely stated problem must be transformed to the event processing paradigm in order to detect the condition of interest. Hence, some degree of understanding of the underlying condition detection architecture is required to formulate the desired condition to detect. Furthermore, this conventional solution is much more difficult to understand than the original problem statement. The intent of the condition in the original problem is easily understood by simply reading the statement. However, an analysis of the conventional specification is required to understand the meaning of conventional representation of the condition. The above example is relatively simple in nature. When more types are added to the condition, a permutation of event filters can occur (depending on the condition). This may cause the condition formulation to become rather lengthy, and thus more difficult to understand.
Another problem encountered by many distributed systems occurs when the system is comprised of computers from different manufacturers. This complicates the task of getting computers to work together efficiently. Computers in these xe2x80x9cmulti-vendorxe2x80x9d distributed systems are usually difficult to operate together because they do not use common data-formats or common security mechanisms. The lack of a common network naming scheme also limits the degree to which computers can share information.
The present invention thus recognizes that it would therefore be advantageous to have an automated method and system for efficiently detecting conditions of interest in distributed systems. It would also be advantageous if such a method and system did not require knowledge of the underlying condition detection architecture.
It is therefore one object of the present invention to provide an improved method and system for distributed systems.
It is another object of the present invention to provide a method and system for determining conditions of interest in distributed systems.
It is yet another object of the present invention to provide an efficient, automated method and system for determining conditions of interest in distributed systems.
The foregoing objects are achieved as is now described. A method is disclosed for detecting conditions of interest in a distributed system. First, a condition of interest is formulated as a Boolean statement and stored at a condition detection component of the distributed system. The condition detection component then selectively receives events related to the condition of interest from a data collection component. Finally, the received events are combined and evaluated against the Boolean statement to determine if the condition is met. In one embodiment, the condition also includes a WITHIN element which filters out events received outside of a pre-specified time period.
The above as well as additional objects, features, and advantages of the present invention will become apparent in the following detailed written description.