Applications for monitoring data processing systems play a key role in their management. For example, those applications are used to detect any critical condition in the system (so that appropriate corrective actions can be taken in an attempt to remedy the situation). Typically, the essential information relating to the critical conditions being detected is logged; the information is then available for off-line analysis through data warehousing techniques.
For this purpose, selected performance parameters of the system (such as a processing power consumption, a memory space usage, a bandwidth occupation, and the like) are measured periodically. The information so obtained is then interpreted (for example, according to a decision tree) so as to identify any critical condition of the system. For example, the occurrence of a low response time of the system can be inferred when both the processing power consumption and the memory space usage exceeds corresponding threshold values. The monitoring applications known in the art are configured with predefined corrective actions, which are launched in response to the detection of corresponding critical conditions.
A drawback of the solution described above is that they can only be used to recover the correct operation of the system. Indeed, the corrective actions are executed when any problem has become severe and the system cannot continue working properly. Therefore, those solutions are completely ineffective in preventing the occurrence of the problems in the system.
Moreover, the corrective actions typically try to reset the system to its initial condition preceding the occurrence of the problem. However, this strategy is often ineffective in eliminating the problem on a long-term basis (with the same problem that is likely to appear again in the future).
In any case, the corrective actions must be quite aggressive to be effective in solving the problems; for example, the corrective actions can involve restarting the system, deleting temporary files or eliminating jobs from a queue. Therefore, the corrective actions typically have detrimental side effects. For example, the application of the corrective actions can cause an abrupt decrease of performance of the system (and then of any application running thereon). Moreover, most corrective actions have a potential high impact on the business relating to operation of the system; for example, the corrective actions can cause a service interruption or a loss of valuable data. Therefore, those corrective actions must be used very carefully; as a consequence, most system administrators are reluctant to enable the above-mentioned functionality of the monitoring applications.