Nowadays, as information systems become ubiquitous, and companies and organisations of all sectors become drastically dependent on their computing resources, the requirement for the availability of the hardware and software components of an IT network and of services based on it, (hereinafter all three are generally referred to as “objects”) is increasing while the complexity of IT networks is growing.
There are monitoring systems available, which enable the availability and performance of objects within an IT network to be monitored and managed.
For example, Hewlett-Packard offers such a product under the name “OpenView VantagePoint”. A personal computer, server, network interconnecting device or any system with a CPU is called a node. The nodes of an IT network monitored by such a monitoring system are called monitored nodes. On a monitored node, a program or process runs as a background job which monitors the occurrence of certain events (e.g. error messages) at the node and generates event-related messages according to rules which can be defined by a user. Such a program or process is called an “agent”. An agent is not limited to passive monitoring, e.g. by collecting error messages. Rather, it can carry out active monitoring of hardware and processes. For example, an agent can periodically (e.g. every five minutes) send requests to a process (e.g. an Oracle process) to find out whether the process is still active. A response saying that the process is no more active (or the absence of a response) may also constitute an “event”. The messages generated by the agents are collected by a monitoring server which stores and processes them and routes the processing results to a monitoring console by means of which an IT administrator or operator can view the status and/or performance of the IT objects.
A monitoring system of that kind increases the availability of the IT objects under consideration since it enables a fault or failure of a component of the monitored network to be quickly detected so that a repair action or the like can immediately be started.
In such a known monitoring system, the operator is often overwhelmed with a large number of messages which are displayed at the monitoring console. Some of these messages are related to others, for example, when an application goes down and starts up again. Another example of related messages is the case in which a network router goes down so that all nodes beyond the router can no more be reached. As a consequence, the agent associated with the router will not only send messages indicating that the router is down, but will also send a larger number of messages that the nodes beyond the router are not available. Another fraction of the large number of messages sent to the monitoring console are messages which report similar or identical events, for example three messages reporting that a user switches to user root three times. Another fraction of messages report how a problem situation evolves with time, for example, how the amount of free disk space decreases or increases on a monitored node.
In some monitoring systems, messages can be correlated in order to suppress redundant messages, such as messages reporting identical events or the evolution of a problem situation. This is achieved by a kind of superordinate correlation analysis which is carried out by the monitoring server. However, the definition of this correlation analysis which has to be provided by the user, is rather complicated. Further, the correlation analysis requires considerable computing and database access resources. A correlation of messages is not always possible since the available messages do not always contain the full information required for a correlation check.
Therefore, a monitoring system is desirable in which identical or related events can be more easily suppressed.