Applications for monitoring data processing systems play a key role in their management. For example, those applications are used to detect any critical condition in the system (so that appropriate corrective actions can be taken in an attempt to remedy the situation). Typically, the essential information relating to the critical conditions being detected is logged; the information is then available for off-line analysis through data warehousing techniques.
For this purpose, predefined state parameters of the system (such as a processing power consumption, a memory space usage, a bandwidth occupation, and the like) are measured periodically. The information so obtained is then interpreted according to a decision tree. The decision tree includes intermediate nodes each one defining a test based on the state parameters; the branches descending from the intermediate node correspond to the possible outcomes of the test. Each leaf node identifies the condition of the system (correct or critical). Typically, the tests are based on comparisons between one or more state parameters and corresponding threshold values. The threshold values are defined statically by an administrator of the system; for example, the occurrence of a low response time of the system can be inferred when the processing power consumption exceeds 70% and at the same time the memory space usage exceeds 60%.
A drawback of the solution described above is that the definition of the threshold values is strongly dependent on the characteristics of the system to be monitored. Therefore, this process implies a deep knowledge of the system; in any case, the results are always dependent on the skill of the administrator. Moreover, the threshold values cannot be defined in general terms for every system. For example, a processing power consumption lower than 70% can be acceptable in most practical situations; however, a far lower threshold value (for example, 50%) could be necessary in critical applications. Likewise, the behavior of the system usually changes at run-time, so that the threshold values that have been set at the beginning can be no longer valid later on. As a consequence, the threshold values must be selected according to the worst case, thereby increasing the detection of (alleged) critical conditions.
In any case, the available solutions can only be used to recover the correct operation of the system. Indeed, the decision tree detects any critical condition when it has already occurred and the system cannot continue working properly. Therefore, those solutions are completely ineffective in preventing the occurrence of problems in the system.
A possible solution could be that of lowering the threshold values; in this manner, it is possible to reduce the risk of experiencing any malfunctioning in the system (since the critical conditions are detected in advance). However, this approach has a deleterious effect on the operation of the system; indeed, the use of lower threshold values involves a dramatic increase of the detection of (alleged) critical conditions.
An additional drawback is due to the fact that the corrective actions that are taken in response to the detection of the critical conditions can be ineffective. Particularly, in many situations it is not possible to ascertain whether the critical condition detected by the monitoring application actually requires any corrective action. A typical example is that of a transient phenomenon, wherein the system automatically recovers to its correct operation; in this case, it would be preferred to take no correction action (since any intervention on the system could worsen the situation).