Administrators are often interested in measuring the resources a task, such as a statement or database query, consumes. For example, administrators like to know how much central processing unit (CPU) time a query uses. If a server's CPU utilization is high, administrators would like to know which queries use the most CPU time. Query resource consumption is not possible to measure directly in many cases.
Described with reference to the example above, such a problem can be more generally expressed as follows: given many samples of several independent variables x1 . . . xn and a single dependent variable y observed from a system (such as a database server), coefficients c1 . . . cn can be calculated such that the equation y=c1x1+ . . . +cnxn describes the system's observed behavior as closely as possible. Each independent x variable is a query/task class's execution time, and y is the observed CPU time or other metric of interest. A coefficient for each class of queries can then be computed. When multiplied by each sample of the query class's accumulated execution time, a value for that query class's CPU time can be obtained. When summed together, these values should approximate as closely as possible the observed CPU time.
However, such an attempt at describing the problem can be a simplification of linear regression. Linear regression attempts to fit a hyperplane through a multidimensional sample space while minimizing the sum of squared errors. The equation that describes the hyperplane is similar to the equation y=c1x1+ . . . +cnxn. Each dimension of the hyperplane is described by a slope, and there is a single intercept. The simplest case of a hyperplane with only one independent variable is a line described by the equation y=ax+b, where a is the slope of the line and b is the y-intercept. Linear regression fits the equation to the sample data and produces a slope and intercept for each dimension, along with an estimated standard error for the slope and intercept.
For example, the result of a simple least-squares linear regression against the values in Table 1 is the equation y=4.032x+31.602. The dataset and this line can be seen in FIG. 1.
TABLE 1Sample dataset to illustrate regressionx y1293145187623145953When, however, multiple independent variables are present, an assumption can be made that a cause-and-effect relationship between the variables may exist, and that some of the independent variables may not directly influence the dependent variable. To compensate for this assumption, multiple linear regression examines not only the correlation between the independent variables and the dependent variable, but also that of each pair of independent variables. The number of correlations computed in multiple linear regression is thus n(n−1), which is O(n2) in the number of independent variables. In other words, multiple linear regression is computationally expensive with a large number of variables (classes of queries). But the reality of queries executing in a database server is that each query consumes its own CPU time, because a query is a program with CPU instructions.
Multiple linear regression also tends to produce poor results on large numbers of independent variables, because independent variables with large contributions to the dependent variable are more statistically significant than minority contributors. In fact, the few most significant independent variables usually explain most of the results, and trying to include all independent variables in the computation typically produces a “worse fit” than discarding all but the top few.
It is also common for some classes of tasks to consume resources more or less constantly, and for occasional resource-intensive tasks to dominate. Administrators do not want results to be rendered useless by such inevitable outliers. This observation about a drawback of multiple linear regression: each independent variable is assessed in terms of its contribution to the dependent variable. That is, linear regression computes the likely relationship between each x value and the resulting y value. The underlying mechanism of regression is that correlations between independent and dependent variables (i.e., coincidences) are likely to be meaningful; the method essentially treats correlation as causation, adjusting coefficients to find a best-fit given the presence of inputs that support or reject this hypothesis.
Accordingly, a need exists for devices and apparatuses for system monitoring that more accurately estimate performance of a processing system executing one or more tasks.