The invention relates generally to the field of event detection and fault diagnosis for computer systems and, more particularly but not by way of limitation, to techniques (devices and methods) for defining and using fault models for the monitoring, diagnosis and recovery of error conditions in a enterprise computing system.
Contemporary corporate computer networks comprise a plurality of different computer platforms and software applications interconnected through a number of different paths and various hardware devices such as routers, gateways and switches. workstations, dedicated file, application and mail servers and mainframe computer systems. Illustrative software applications include accounting, payroll, order entry, inventory, shipping and database applications. The collection of such entities—hardware and software—is often referred to as an “enterprise.”
As enterprises have become larger and more complex, their reliability has become ever more dependent upon the successful detection and management of problems that arise during their operation. Problems can include hardware and software failures, hardware and software configuration mismatches and performance degradation due to limited resources, external attacks and/or loss of redundancy. Operational problems generate observable events, and these events can be monitored, detected, reported, analyzed and acted upon by humans or by programs. It has been observed that as an enterprise grows (i.e., incorporates more monitored components—hardware and software), the rate at which observable events occur increases dramatically. (Some studies indicate event generation rates increase exponentially with enterprise size.) Quickly and decisively identifying the cause of any given problem can be further complicated because of the large number of sympathetic events that may be generated as a result of an underlying problem(s). In the field of enterprise monitoring and management, the large number of sympathetic events that are generated as a result of one, or a few, underlying root cause failures, is often referred to as an “alert storm.” For example, a router failure may generate a “router down” event and a large number of “lost connectivity” events for components that communicate through the failed router. In this scenario, the router failure is the fundamental or “root cause” of the problem and the lost connectivity events are “sympathetic” events. Studies have estimated that up to 80% of a network's down-time is spent analyzing event data to identify the underlying problem(s). This down-time represents enormous operational losses for organizations that rely on their enterprises to deliver products and services.
One prior art approach to enterprise diagnosis relies on user specified rules of the form: IF (CONDITION-A) AND/OR/NOT (CONDITION-B) . . . (CONDITION-N) THEN (CONDITION-Z). Known as a “rules-based” approach, these techniques monitor the enterprise to determine the sate of all tested conditions (e.g., conditions A, B and N). When all of a rule's conditions are true, that rule's conclusion state is asserted as true (e.g., condition-z). While this approach has the advantage of being easy to understand, it is virtually impossible to implement in any comprehensive manner for large enterprises. (The number of possible error states (i.e., combinations of conditions A, B . . . N) giving rise to a fault (e.g., conclusions), grows exponentially with the number monitored components.) In addition, rules-based approaches are typically tightly coupled to the underlying enterprise architecture such that any changes in the architecture (e.g., the addition of monitored components) requires changes (e.g., the addition of rules) to the underlying rule-set. Further, in a dynamic environment where monitored components are added and/or removed on a weekly or daily basis, rules-based approaches becomes nearly impossible to implement in a controlled and reliable manner because of the overhead associated with creating and/or modifying rules for each component added or the removal of one or more rules for each component removed.
Another prior art approach to enterprise diagnosis is known as “pattern matching.” This approach also uses rules but, unlike the rules-based analysis introduced above, allows two-way reasoning through a rule-set. For example, if a rule of the form IF (CONDITION-A)AND (CONDITION-B) THEN (CONDITION-C) exists and both condition-A and condition-C are known to be true (e.g., through monitoring or measurement), these systems can infer that condition-B must also be true. This, in turn, may allow the satisfaction of additional rules. While more powerful and flexible than standard rules-based systems, pattern matching systems, like rules-based systems, must have all (or a sufficient number) of their error states defined a priori and are difficult to maintain in a dynamic environment.
Yet another prior art approach to enterprise diagnosis uses signed directed graphs (digraphs) to propagate enterprise variable status between different modeled nodes, where each node in such a model (a fault model) represents a condition of a modeled component in the enterprise. For example, a node in an enterprise fault model may represent that port-1 in router-A has failed or that Database Management Application-Z has less than a specified amount of working buffer memory available. Many digraph implementations use generic, class-level models of the monitored components in a classic object-oriented approach and, for this reason, are often referred to as Model Based Reasoning (MBR) techniques. Unlike rule-based systems, MBR systems provide a scalable method to identify and propagate detected error states to one or more “event correlation” engines. Event correlation engines, in turn, provide the computational support to analyze the myriad of event indications and to determine, based on preprogrammed logic, what a likely root-cause of the monitored events is. Heretofore, the ability to correlate error states in a highly dynamic environment and to account for the possibility of more than one simultaneous error or fault has stunted the ability of MBR systems to perform up to their theoretical limits. Accordingly, it would be beneficial to provide improved event correlation and analysis techniques for MBR systems.