1. Technical Field
The present invention relates to computer system and network management and more particularly to system and methods for determining importance of alerts in computing systems for problem determination.
2. Description of the Related Art
The complexity of large computing systems has raised unprecedented challenges for system management. Rule-based systems are widely deployed in practice for operational system management. However, the alerts from various rules usually have different problem reporting accuracy because their thresholds are often manually set based on operators' experience and intuition. In the meantime, due to system dependencies, a single problem many trigger many alerts at the same time in large systems and a critical question is which alert should be analyzed first in the following problem determination process.
In current rule-based systems, this is handled with two possible solutions. In one solution, each rule works in its isolated local context and operators have to check alerts one by one. They may use some limited domain knowledge to decide the importance of alerts. For example, an alert from a DNS server is more important than an alert from a printer. Such an approach is not scalable and practical for large systems with huge complexity.
In a second solution, event correlation mechanisms are used to correlate a set of alerts with a specific problem, i.e., to define the signatures of known problems with a set of alerts. This approach has to assume prior knowledge of various problems and their signatures. However, many problems are not anticipated and well understood in large and complex IT systems. Due to system dynamics and uncertainties, even the same problem may manifest itself in very different ways. Therefore, it is difficult to precisely define problem signatures in complex and dynamic systems.