1. Field of the Invention
This application relates generally to event correlation in complex systems using codebook correlation techniques to relate events such as, e.g., problems, to observable events such as symptoms and, more particularly, to determining root problems or other events using codebook correlation when some observable events are indeterminable or otherwise unknown.
2. Description of Related Art
Codebook correlation is a technique used for identifying the root cause of a problem or other event in a system. Examples of codebook correlation techniques are described in U.S. Pat. Nos. 5,528,516; 5,661,668; 6,249,755: and 6,868,367, all issued to Yechiam Yemini et al. (herein referred to as the “Yemini et al. patents”), which are incorporated herein by reference in their entirety.
Codebook correlation can be applied to virtually any system generating events. Such systems can include, but are not limited to, enterprise management systems, engineering systems, communications systems, networked information technology (IT) systems, distributed systems, application services, application servers, utility computing systems, autonomic systems, grid computing systems, satellites, business process systems, utility systems, electric power grids, biological systems, medical systems, weather systems, financial market systems, weapons systems, complex vehicles such as spacecraft, medical diagnosis, and financial market analysis.
Briefly, codebook correlation relates events using a data structure such as a mapping, e.g., a table or graph relating particular events represented in a given column with other events represented in a given row. Deterministic or probabilistic approaches can be used. If using a deterministic approach, the intersection of each row and column can be designated, e.g., as “1” if the event causes the other event and “0” otherwise. Alternatively, if using a probabilistic approach, the intersection of each row ‘E2’ and column ‘E1’ can be designated ‘p’, where p is the probability that E1 causes E2.
In one application, codebook correlation can be used to relate a particular event such as problems or other exceptional events to observable events such as symptoms. One can in addition also relate any event exceptional or not to symptoms. In this case, the codebook table can have each row “S” corresponding to a symptom, and each column “P” corresponding to a problem. Using a deterministic approach, the intersection of each row and column can be designated, e.g., as “1” if the problem causes the symptom and “0” otherwise. Alternatively, probabilities can be used where the intersection of each row S and column P can be designated p, where p is the probability that P causes S.
Each column thus created specifies the “signature” of a problem, i.e., it identifies a set of symptoms that a problem causes. An observer of the symptoms of a working system can use the codebook columns to quickly identify the respective problem. Several extensions and variations of these correlation techniques are described in the Yemini et al. patents, and are incorporated by reference herein.
In some cases, the observed symptoms do not exactly match any of the codebook problem (or other event) signatures. In such cases, the distance between the observed symptoms and the problem signature can be determined to find a sufficiently close match. There are several possible ways of determining the distance. One method in the case of deterministic codebooks (with ‘1s’ and ‘0s’ in each column) is to count the number of mismatches between the signature and the observed symptoms. This number of mismatches defines a so-called Hamming distance between the observed symptoms and the signature. The case where the columns of the codebook have probabilities involves finding the most probable combination of problems that generates symptoms closest to the observed ones. A particular definition of distance between observed symptoms and signatures can be used. For example, see David Alan Ohsie “Modeled Abductive Inference for Event Management and Correlation” (1998) Ph.D Dissertation, Columbia University (hereinafter the “Ohsie thesis”) Section 5.2 provides a definition of such distance and several algorithms (some employing heuristics) that can find how close observed symptoms and signatures are according to the defined distance.
The symptoms generated by a system are usually detected by a subsystem generically referred to herein as instrumentation. Such instrumentation detects and relays events to a management system, and can include, e.g., hardware components and software components (such as agents) associated with elements of the system.
Sometimes the scheme used by the instrumentation to detect symptoms might not poll all possible events in the system. In this case, there is no easy way to know if the symptom happened or if the instrumentation failed to detect it. The symptom can thus be said to be in an “unknown” state, i.e., whether the symptom happened or not cannot be assessed. This can lead to an incorrect diagnosis of the root cause problem. Symptoms that are not in “unknown” state are said to be in “known” state.