The past few years have seen the emergence of a large class of tasks that benefit from analysis of SQL workload information. Some examples of such tasks are database administration and understanding user/application behavior. The problem of collecting a SQL workload is facilitated by profiling tools provided by commercial database vendors that log SQL activity on the server over a representative period of time. However, little attention has been paid to understanding the requirements for analysis of the collected workload.
In analyzing workloads, it is helpful to identify a small representative subset that captures the essence of the large workload that has been logged using automated tools. There are multiple reasons why picking such a small representative subset is necessary. First, the resources needed to accomplish tasks such as index tuning grows with increasing sizes of workloads. There is a significant benefit to be gained by “filtering” the workload before it is fed to these tasks, while not compromising its characteristics. A second motivation for identifying a small representative subset is the necessity of having to meaningfully summarize data for viewing by DBAs or analysts. It is important to be able to offer relatively small representative synopses of the workload, before “drilling down” to identify the queries of interest.
A simple way to obtain a representative subset is to pick a uniform random sample of the workload. While sampling is conceptually simple, and, in fact, useful in many situations, a DBA may like to obtain a representative subset that has additional constraints, e.g., pick a representative workload with the 100 most expensive queries while ensuring that every table in the database occurs in at least 5 queries. Thus, the specification for picking a representative subset for a workload (henceforth called a “summary”) depends on the task at hand and, more often than not, sampling in itself is not adequate.
Recently, several tools have emerged that exploit knowledge of the database workload for a variety of tasks such as physical design tuning, feedback based optimization, and approximate query processing. There has also been work on classifying database workloads, such as OLTP vs DSS, so as to enable automatic tuning and configuration of database system parameters based on workload type. Typically, workloads collected by today's database profiling tools can be very large (millions of statements), whereas most of the above tools work efficiently for relatively small workload sizes.
The concept of workload compression has been introduced to find a smaller workload whose use results in the same application quality as when the original workload is used. Workload compression requires applications to specify custom distance functions that quantify how close two statements in the workload are. Providing such distance functions may not be easy for all applications. REDWAR allows simple aggregations over the structure and complexity of SQL statements and transaction run-time behavior. A variety of techniques for producing a workload model have also been presented. However, the “representativeness” of these workload models varies.