The present invention relates generally to self-healing systems. More specifically, the invention relates to a method of ignoring redundant symptoms in self-healing systems which use a correlation engine and an analysis engine to diagnose and resolve fault conditions within the system.
Autonomic computing systems, also known as self-healing systems, are computing systems which incorporate knowledge about how to diagnose and repair faults occurring within themselves. In many such systems, this knowledge may consist of rules which define a variety of observable phenomena within the system. For each such observable phenomenon which is deemed harmful, the rules may further prescribe a specific corrective action. Autonomic systems may also incorporate non-rule-based techniques, including probabilistic algorithms and artificial intelligence.
Self-healing ability is beneficial for many reasons. It is frequently impractical for humans to monitor manually a system at all hours. Even when the resources exist to do so, human involvement inherently incurs the risk of human error. Moreover, some systems are so complex that no single human being understands every part of the system sufficiently to be able to correct all possible fault conditions. By contrast, self-healing methods may detect and correct faults over arbitrarily large systems.
Many autonomic computing systems known in the art incorporate two primary components: a correlation engine and an analysis engine. The correlation engine receives a stream of events, each representing an observable phenomenon which recently occurred within the system. For example, the failure of an attempt to connect to a database may constitute an event. A single event may independently represent a fault condition requiring corrective action. More frequently, however, a fault condition is detected based on multiple related events. Thus, the correlation engine may apply rules to correlate events into sets of events which, as a group, may represent a fault condition. Such sets are known in the art as “correlated event sets.” It is noted that a correlated event set need not represent a fault condition with certainty. For example, the correlated event set may represent a substantial probability of a fault, with further analysis required to positively determine whether or not a fault has occurred. Thus, when a correlated event set is determined, the correlation engine forwards it to the analysis engine.
The analysis engine determines which corrective action, if any, should be taken in response to the correlated event set provided by the correlation engine. For example, in response to a correlated event set specifying that a database connection failed, the analysis engine may determine that the database crashed. It may, as a result, restart the database server.
In modular self-healing systems, the correlation engine and analysis engine are typically independent of each other. Each engine typically does not have any knowledge about the workings of the other engine. Furthermore, the two engines typically have limited communication with each other.