Large data processing systems, such as web serving systems, include of a multitude of hardware and software components that interact in a complex way. Such components include several tiers of execution units, web application environments, and databases. Performance modeling and evaluation includes building a queuing model of the computer system as a whole, characterizing the workload to the computer system, and analyzing the queuing model using a workload model to obtain performance measures. Modeling a computer system involves hardware components as well as software components. The hardware components include processing units (CPU), data storage units (RAM and disks), and communication channels. Such hardware components are resources shared by concurrent tasks executing in the system. When a task needs a resource that is not available, the task will wait in a queue until the resource becomes available. The interconnection of resources, along with their multiplicities, their capacities, and corresponding queuing disciplines form the basis for building a queuing model of the system hardware. In addition, there are software resources, such as threads of execution, database locks, and communication connections. Similar to hardware resources, tasks use software resources and queue for their usage if they are not available. Thus, there are queuing models of the system software. An overall system model combines both hardware and software components. The users of system resources, hardware or software, are tasks that get generated due to requests, as in an interactive workload, or due to job submissions, as in batch or long running workload. Different types of workload exhibit different behavior, as far as the amount of resources needed and the usage pattern of such resources are concerned.
There are several approaches to solving this problem. One approach is to employ a closed-loop feedback controller which adjusts its control variables in reaction to changes in the external observations, such as the average response time. There are several disadvantages to this approach, however. For example, it is oblivious to the system bottleneck and may drive the system into an undesirable, saturated state just to learn the effect of a given setting of the control variables. Another approach is to use an open-loop controller which uses a linear (or nonlinear) behavioral model for the computer system under control. The parameters of the model may be determined statically using off-line analysis, or dynamically using online measurements and analysis. One disadvantage of this approach is that the number of model parameters increases as an accurate model is sought, hence the parameter estimation problem becomes more complex. Further, the bottleneck resource is not represented explicitly in the model, and therefore it is not straightforward to determine the workload level that would not saturate the computer system.
In an environment such as the one described above, controlling the workload traffic, by limiting concurrency and/or throughput, becomes crucial in maintaining good system performance. Typically this control of workload is achieved through deploying monitoring agents on the various nodes to collect statistics related to the utilization of the various resources and the timing of requests as they receive their service. Such measured data is then used by an analyzer component to determine the bottleneck resource in the system. Then, a workload controller adjusts control variables such as concurrency and throughput limits. Thus, it is a requirement to have monitoring agents on the nodes, which are the controlled elements in this case. Such monitoring agents are software components that require a specific runtime environment, such as a specific level of an operating system or an application server. In a computer complex where nodes may be available from different vendors, it is not feasible to assume that the monitoring agents may be deployed on all nodes. Thus, one needs to have a solution for controlling the workload of nodes without relying on monitoring agents internal to the nodes, rather by just relying on external observations. The challenge is to identify the bottleneck resource inside a black box through only external observations.