A critical element of computer operations and data management is managing performance problems, such as resolving long response times in client-server systems and low throughputs for nightly database updates. Such considerations require mechanisms for detecting, diagnosing, and resolving performance problems. The present invention addresses diagnosis of computer performance problems.
Often, diagnosis proceeds in two phases. The first is problem isolation in which the problem is localized, such as to a component (e.g., router), user, and/or time. The second phase is root-cause analysis in which the underlying cause of the problem is determined, such as identifying a business application that uses high-overhead operating system services.
Ideally, both phases of diagnosis are automated. Unfortunately, it is difficult in practice to automate root-cause analysis in that such analysis requires a detailed knowledge of the system being analyzed. The requisite knowledge is often difficult to acquire, represent, and maintain.
In contrast, problem isolation can be approached in a more general way. Consider a computer installation in which users are experiencing long response times. A commonly used approach to problem isolation is to structure measurements based on abstraction hierarchies. Each hierarchy consists of multiple levels, and within levels there are abstraction instances. Problem isolation consists of repeatedly: (a) selecting levels within abstraction hierarchies that best characterize the problem and then (b) focusing on those abstraction instances that most evidence the problem within the levels selected.
To illustrate the concept of abstraction hierarchies and to demonstrate problem isolation based on abstraction hierarchies, a running example is introduced. The example considers response time problems for which there are three abstraction hierarchies: time, configuration element, and workload. Within the time hierarchy, there are levels for shift, hour, and minute. Within shift, the abstraction instances are 1, 2, and 3. Within hour are abstraction instances of hours for a shift. For example, shift 1 has the hours 8, 9, 10, 11, 12, 1, 2, 3, and 4. Within minute are the minute values within an hour for a shift. Similarly, the configuration element hierarchy has the levels subnet and host; the workload hierarchy has levels division, department, user, and transaction.
To demonstrate problem isolation using abstraction hierarchies, the following scenario is considered for the running example, with reference to FIG. 2:
Step 1: The analyst computes average response times for the highest level in each abstraction hierarchy. That is, the analyst computes average response times for each: (a) shift (for the time hierarchy), (b) subnet (for the configuration element hierarchy), and (c) division (for the user hierarchy). PA1 Step 2: The analyst makes a judgment as to which abstraction hierarchy best isolates the performance problem. In the running example, the analyst selects the hierarchy with the largest range of response time values, which is the configuration element hierarchy. PA1 Step 3: The analyst makes a judgment as to which instances in the abstraction hierarchy selected in Step 2 best localize the problem. In the running example, the analyst selects the instance with the largest value, which is 9.2.15. PA1 Step 4. The analyst repeats the foregoing until problem isolation has been completed. PA1 configuration element: subnet=9.2.15 PA1 time: shift=1, hour=8 PA1 user: division=22, department=MVXD, user=ABC PA1 shift=1, hour=8, minute=1 PA1 shift=1, hour=8, minute=2 PA1 shift=1, hour=8, minute=3
In support of the foregoing scenario, measurement data are often placed into a relational database, which allows analysts to use general purpose reporting and analysis tools. A relational database structures data into tables. The columns or attributes of a table specify information that is present in each row of the table. In the running example, a single table with a column for each level in an abstraction hierarchy is provided, specifically: shift, hour, minute, subnet, host, division, department, user, and transaction.
Although the relational model offers advantages in analysis, it does not directly support abstraction hierarchies. This drawback motivates the structuring of data into a multidimensional database (MDDB), sometimes referred to as On-line Analysis Processing (OLAP). An MDDB provides support for abstraction hierarchies and for navigations between levels in abstraction hierarchies. MDDBs have been implemented in many ways (e.g., R. Agrawal, A. Gupta, and Sunit Sarawagi, "Modeling Multidimentional Databases," International Conference on Data Engineering, 1997, pp. 232-242; R. Kimbell, The Data Warehouse Toolkit, John Wiley and Sons, 1996; and SAS, "SAS/EIS" Software, http://www.sas.com/software/components/eis.html, 1998).
For expository convenience, an MDDB is viewed as a layer on top of a relational database (as in SAS, 1998). An MDDB cube (or slice) specifies a subset of the data within a relational table. Subcubes are structured so as to facilitate the aggregations needed to support abstraction hierarchies. Specifically, a subset of the attributes of the underlying relational table are partitioned into dimensions. Dimensions correspond to the abstraction hierarchies. Attributes within a dimension are arranged hierarchically, which corresponds to levels in the abstraction hierarchies.
Distinction is made between two kinds of dimensions. The first are category dimensions. A category dimension consists of category attributes that qualify the nature of what is measured (e.g., system, time, user). Category attributes are typically strings, time values, or integers. Second, there are metric dimensions. A metric dimension is composed of metric attributes that provide measures of interest (e.g., response times, waits, throughputs). An MDDB schema has one metric dimension and zero or more category dimensions. In the running example, the category dimensions are configuration element, time, and workload. The metric dimension consists only of response time, although there could be other attributes as well, such as wait times and service times (which are components of response times).
A cube is described by an MDDB tuple, which consists of a coordinate vector for each dimension in an MDDB schema. For the metric dimension, the coordinate vector only specifies the level considered (e.g., wait times). For a category dimension, the coordinate vector specifies the abstraction instances used for levels in the dimension hierarchy. For example, a coordinate vector for the time dimension in the running example is: shift=1, hour=8. Note that: (a) not all levels need have an abstraction instance specified, and (b) an abstraction instance is specified at level N in the hierarchy only if an abstraction instance is also specified at level N-1. An MDDB tuple contains all the coordinate vectors for each dimension.
The value of a cube is obtained by querying the underlying relational data and computing an aggregate value of the metric attribute (e.g., average response time). The rows obtained in the query are determined by the category coordinates, and the aggregation function and values used are determined by the metric coordinate. Details on how to construct such a query are contained in the aforementioned Kimbell reference.
"Drill-down" is an operation performed on a cube for one or more dimensions in the MDDB schema of that cube. Drill-down produces new cubes that have: (a) the same coordinates as the original cube for the non-drill-down dimensions and (b) a longer coordinate vector in the drill-down dimension due to the fact that an abstraction instance is specified for the next lower attribute in the drill-down dimension.
Drill-down is illustrated using the running example. Consider the cube with the coordinate vectors:
A drill-down in the time dimension results in a set of cubes, each of which has the same coordinate vectors as the foregoing in the configuration element and user dimensions. Examples of the resulting time coordinates are:
Another operation on subcubes is roll-up, which is the inverse of drill-down. Roll-up takes a cube and dimension as arguments to produce a cube. To illustrate, consider a roll-up on a cube in the time dimension, with time coordinates shift=1, hour=8, minute=3. The new cube has the same non-time coordinates as the original one. The time coordinates are shift=1, hour=8.
Having data structured as an MDDB facilitates the scenario in the running example by providing navigations between levels in abstraction hierarchies. To illustrate, consider Step 1 in the scenario. Here, the analyst computes aggregate values for the abstraction instances within each abstraction hierarchy. This corresponds to doing a drill-down on each dimension, obtaining a value for each cube, and computing a summary statistic of the values.
While an MDDB facilitates problem isolation, it does not automate problem isolation. Indeed, today problem isolation is largely a manual task. Typically, problem isolation requires analysts to do a visual inspection of summary data. For example, in Step 2 of the scenario above, the analyst compares summaries done for different dimensions.
Existing art has severe deficiencies in the area of general-purpose automation of problem isolation, especially problem isolation based on data structured as an MDDB. Expert system diagnostic tools (e.g., B. Domanski, "A PROLOG-based Expert System for Tuning MVS/XA," Proceedings of the Computer Measurement Group, 160-166, 1987 and M. Arinze et al., "A Knowledge Based Decision Support System for Computer Performance Management," Decision Support Systems, Volume 8, 501-515, 1992) provide system-specific analyses of measurement data. While they sometimes provide root-cause analysis, these tools do not provide general-purpose automation of problem isolation in that they are difficult to adapt to the analysis of other systems and to changes in the system being analyzed. An alternative is to employ tools for navigating MDDB structured data (e.g., R. F. Berry and J. L. Hellerstein, "A Flexible and Scalable Approach to Navigating Measurement Data in Performance Management Applications," Second IEEE Conference on Systems Management, Toronto, Canada, June, 1996; C. Sriram, P. Martin, and W. Powley, "A Data Warehouse for Performance Management of Distributed Systems," Third IEEE Conference on Systems Management, Newport, Rhode Island, April, 1998; and the aforementioned SAS reference). However, these tools do not support automated navigation of performance data. Still another approach is to use MDDB navigations in acquiring knowledge of problem isolation (as the aforementioned Hellerstein reference does for data navigations in general). While this art automates the search for navigations done previously, it does not automate the selection of new navigations for problem isolation.
It is therefore an objective of the present invention to provide automated problem isolation in computer systems.
It is another objective of the invention to provide automated problem isolation in systems which have measurement data structured as a multidimensional database.
Yet another objective of the invention is to provide a method which uses external specifications of scoring functions, thereby allowing users to adapt the method to their needs.
Still another objective is to provide problem isolation which can automatically select new data navigation paths for problem isolation.