A critical element of operations and management is managing performance problems, such as addressing long response times in client-server systems and low throughputs for nightly database updates. Such considerations require mechanisms for detecting, diagnosing, and resolving performance problems. Detection uses one or more measurement variables to sense when a problem occurs, such as using on-line change-point detection algorithms to sense changes in client-server response times. Diagnosis isolates problems to specific components so that appropriate actions can be identified, such as attributing large client-server response times to excessive LAN utilizations. Resolution selects and implements actions that eliminate the problem, such as increasing LAN capacity or reducing LAN traffic.
Diagnosis is done for a target system, which may be an individual computer, a network of computers, or a combination of the two. Diagnosis requires extracting measurement data. Some of these data are collected during the time when a performance problem is present. These are referred to as problem data. Additional data may be used as well to obtain reference values for measurement variables. These are called reference data. Reference data may be measurements of the target system when no performance problem is present, values summarized from such measurements (e.g., distribution quantiles), or values obtained from manufacturer specifications (e.g., disk access times).
The present invention addresses quantitative performance diagnosis (QPD). A quantitative performance diagnosis consists of a set of explanations and a quantification of their importance. Ideally, this quantification takes the form of fractional contributions to the performance problem. For example, a QPD might attribute 5% of the performance problem to the explanation that the 30% increase in web server traffic accounts for 90% of the increase in LAN utilization, which in turn accounts for 20% of the increase in client-server response times.
Two benefits accrue from employing quantitative diagnoses rather than qualitative diagnosis. First, the importance of factors affecting performance can be ranked. With ranking, analysts and administrators can quickly focus on the most important problem causes. Secondly, quantitative information is often needed to specify not only the actions, but also the amount of the actions (e.g., by how much must the LAN capacity be increased, or by how much LAN traffic must be reduced) required to resolve the performance problems.
Vendors of diagnostic systems can increase their code reuse (and hence their profits) by employing an architecture that enables their diagnostic system to adapt for the diagnosis of many target systems. Examples of target systems include: operating systems (e.g., Windows 95, AIX, and MVS), database management systems (e.g., DB/2, Sybase, Informix), collaborative middleware (e.g., Lotus Notes and Netscape web browsers), and communications software (e.g., TCP/IP, APPC, SNA). Ideally, the diagnostic system is structured so that only a small, isolated fraction of code need be modified to handle new target systems.
Better still is an architecture that allows end-users of the diagnostic system to represent special characteristics of their environment, such as installation-specific measurement data that can be employed in QPD. Meeting this requirements is more demanding since the diagnostic system must be structured so that representations of the target system are externalized in a way that permits end-user modification without compromising the effectiveness of the vendor's software.
Numerous applications have been developed in the broad area of interpreting data (e.g., U.S. Pat. No. 5,598,511 of Petrinjak et al.) and more specifically for diagnosing performance problems in computer and communications systems. The most common approach employs hand-crafted if-then rules (e.g., U.S. Pat. No. 5,636,344 of Lewis). Systems using customized if-then rules produce qualitative diagnoses, such as "large response times are due to excessive LAN utilization." This approach requires diagnostic systems for which knowledge of the target system must be embedded in them.
When the diagnosis engine embeds the choice of diagnostic technique, the design makes the diagnostic system more difficult to adapt to new target systems, and the design limits the extent to which end-users can customize the diagnostic system.
Ease of adaptation and end-user customization is improved by employing an external representation of the target system. Some applications have attempted an external architecture by externalizing the if-then rules employed. For example, U.S. Pat. No. 5,261,086 of Shiramizu et al provides know-how and declarative rules; U.S. Pat. No. 5,412,802 of Fujinami et al employs cases, rules, and heuristics; and, U.S. Pat. No. 5,428,619 of Schwartz et al. has components for network topology, cases, and a diagnosis tree. Related work in fault diagnosis considers external models of the system, such as failure modes, qualitative models of system behavior, and fault trees. It is to be noted that none of the aforementioned systems employs external representations that are sufficient to provide quantitative performance diagnosis.
Analytic techniques are used for fault detection and diagnosis in technical processes wherein system behavior is expressed in terms of a set of equations. The approach entails: (a) computing the difference between expected and observed values of variables (referred to as residuals) and (b) employing techniques to assess the significance of the residuals. While this analytic approach employs an external representation of the target system, its application to performance diagnosis has a fundamental deficiency, in that the methodology does not quantify contributions to performance degradation. Instead, the prior art analytic approach assesses the probability of a binary event such as the failure of a component.
Some have approached QPD by using custom-tailored analytic models. However, doing so results in the same kind of system dependencies as with hand-crafted rules. Another approach is to externalize quantitative relationships present in the target system. To date, QPD algorithms taking this approach have employed representations of the target system that consist of tree-structured algebraic relationships between measurement variables.
There are several drawbacks to the existing systems that employ tree-structured algebraic relationships. Firstly, the specifics of the art limit its applicability for QPD (e.g., a system developed for application in explaining financial models will be limited to the domain of financial planning). Secondly, tree-structured representations of target systems are quite restrictive. In particular, multi-parent dependencies arise when shared services are present, thereby necessitating a directed acyclic graph (DAG) that cannot be represented as a tree. Shared services are a key element of modern computer and communications systems. Examples include having multiple clients share a server process and having multiple server processes share host resources such as communications buffers. Extending tree-structured representations to ones that employ directed acyclic graphs is not straigthforward for QPD. In particular, care must be taken to avoid double-counting the contributions of measurement variables (since there may be multiple paths from the detection variable to leaves in the graph).
Finally, a third drawback to prior art diagnosis systems relates to quantifying the effects of performance problems on measurement variables. Such quantification is an essential part of QPD. Several diagnostic techniques have been proposed:
(a) finding the variable with the largest value; PA1 (b) finding the variable whose value changed the most; or PA1 (c) finding the variable that has the largest absolute value for its cross correlation with another variable.
Weighted graphs are widely used in computer systems, such as for use in determining routing distances. Further, many have discussed the construction of algorithms employing graphical representations of systems and weighted graphs. In addition, existing algorithms for qualitative performance diagnosis employ graph representation. However, to those skilled in the art, it is not obvious how to transform qualitative performance diagnosis into an algorithm involving the navigation of a weighted graph.
What is needed, therefore, is a diagnosis engine that uses external representations of diagnostic techniques to control the quantification of performance problems on measurement variables.