Applications for monitoring data processing systems play a key role, especially in managing large systems with distributed architectures. Such monitoring applications may be used to detect any critical conditions that occur in the systems. Information gathered by a monitoring application can then be used for enforcing appropriate corrective actions in an attempt to remedy unfavorable situations, or for off-line analysis.
The process of monitoring a system is typically based on the periodic measurement of predefined state parameters such as processing power usage. The monitoring application detects a critical condition when the state parameter reaches a predefined threshold value.
Some monitoring applications known in the art allow defining different levels of critical conditions with corresponding threshold values. For example, if processing power usage is higher than 60% a warning critical condition may be detected, and if processing power usage exceeds 80% a dangerous critical condition is entered. One drawback of this approach is that the monitoring application may report a huge number of critical conditions, since critical conditions are detected as soon as state parameters reach the corresponding threshold values. Therefore, a system administrator may be swamped with notifications caused by transient problems such as spikes in processing power usage.
Different monitoring applications may consider the persistence of critical conditions. In this case, it is possible to define how long a condition must last before being classified as critical. For this purpose, the monitoring application may define a minimum number of times a state parameter must reach its threshold value before being consolidated into the corresponding critical condition. The occurrences of a potentially troublesome event must be consecutive, or at most have a maximum number of allowable missing occurrences or “holes” between them. However, this approach requires the definition of a single critical condition for each state parameter, without the possibility of having different levels of detail.
In any case, monitoring applications known in the art provide only static information about the health of a system. In other words, the administrator is notified simply of the occurrence of a problem, without receiving any information about the actual dynamics of the system.
Therefore, the information provided by the monitoring system can be used only to restore proper operation after a critical condition has been detected, rather than to prevent the occurrence of problems in the future.