Today's enterprise systems use a collection of elements to provide business functions such as claims processing and customer orders. Examples of these elements are databases, web servers, and application servers. Collectively, these are referred to as resources. Resources typically have “sensors” that provide data values for metrics (e.g., request rate at a web server) and “effectors” that provide a way to modify their behavior (e.g., central processing unit (CPU) priority in an operating system). Service level management attempts to collect the data that provides a way to identify the appropriate settings for effectors so as to achieve desired service levels.
Central to service level management are tasks such as health monitoring to determine if the system is in a safe operating region, early detection of service level violations, and ongoing optimization of configurations to ensure satisfactory performance. All of these tasks require quantitative insights, preferably quantitative models that predict service level metrics such as response time. Compared to using rule-based policies to conduct the above tasks, having quantitative models can provide a more accurate representation of system behaviors and thus improve the quality for service level management.
Unfortunately, constructing such models requires specialized skills that are in short supply. Even worse, rapid changes in provider configurations and the evolution of business demands mean that quantitative models must be updated on an ongoing basis.
A variety of quantitative models are used in practice. For example, DB2 (available from IBM Corporation of Armonk, N.Y.) performance metrics may be related to response times using a model of the form y=b1x1+b2x2+ . . . +bnxn. Here, y is response time, the xi are DB2 resource metrics (e.g., sort time, total buffer pool read time), and the bi are constants estimated from the data using least-squares regression. Variable y is referred to as the “response variable” and the xi as the “explanatory variables.” Other examples of quantitative models include queueing network models (e.g., Leonard Kleinrock, “Queueing Systems,” Volume I, Wiley, 1975), neural network models (e.g., Simon Haykin, “Neural Networks,” Macmilian College, 1994), and nearest neighbors approaches (e.g., J Aman et al., “Adaptive Algorithms for Managing a Distributed Data Processing Workload,” IBM Systems Journal, 36(2), 1997).
Many researchers have investigated the detection of service degradations. Central to this approach is modeling normal behavior. For example, R. A. Maxion “Anomaly Detection for Diagnosis,” Proceedings of the 20th International Annual Symposium on Fault Tolerance, June, 1990, uses ad hoc models to estimate weekly patterns; P. Hoogenboom et al., “Computer System Performance Problem Detection Using Time Series Models,” Proceedings of the Summer USENIX Conference, 1993, employs more formal time series methods; and M. Thottan et al., “Adaptive Thresholding for Proactive Network Problem Detection,” IEEE Third International Workshop on Systems Management, April, 1998, uses techniques for detecting changes in networks that are leading indicators of service interruptions.
Further, statistical process control (SPC) charts are widely used for quality control in manufacturing to detect shifts in a process as determined by an appropriate metric(s). These techniques have been applied to computing systems to track critical metrics (e.g., J. McConnell et al. “Predictive Analysis: How Many Problems Can We Avoid?,” Networld+Interop, Las Vegas, 2002). However, none of these approaches employ on-line model construction.
Still further, R. Isermann et al., “Process Fault Diagnosis Based on Process Model Knowledge,” Proceedings of 1989 SAME International Computers in Engineering Conference and Exposition, July, 1989, uses knowledge of the functional relationship between inputs and outputs to detect changes in system operation. However, the R. Isermann et al. work does not address how to identify a small set of explanatory variables.
Several companies market products that aid in constructing performance policies. For the most part, the techniques employed are based on the distribution of individual variables, not relationships to response variables. One exception is correlation analysis, which uses cross correlation to identify the most important explanatory variables. However, this approach does not model the response variable. Thus, many redundant variables may be included. Further, all of the existing work assumes that the set of potential explanatory variables is known a priori rather than discovered on-line.
Thus, a need exists for improved service level management techniques that overcome these and other limitations.