Information on how a computer database system is used can be important in performance tuning and management of the database system. In the context of relational databases, one specific form of usage information is the workload, which is typically a set of SQL statements executed by users of the system. Database practitioners and vendors have recognized the opportunity to tune and manage various aspects of database systems by analyzing workload information. Several workload-driven tasks have emerged recently for solving problems such as histogram tuning, improving query optimization, index selection, approximate answering of aggregation queries, and statistics selection. The term application is used to generically refer to such workload-driven tasks.
A key factor affecting the scalability of an application is the size of the workload, i.e., the number of SQL statements in the workload. In many cases, the workload consumed by the application is gathered using mechanisms in modern DBMSs that allow recording of SQL statements that execute on the server. In order to capture a representative collection of statements that execute against the system, the user of the application, such as a database administrator—could collect as the workload a log of SQL statements over a sufficiently large window of time (e.g., a day or week). Consequently, workloads tend to be large in size. Moreover, these applications often perform detailed analysis of queries in the workload and their inter-relationships, and hence their running time can be affected significantly as the workload size increases.
Operation of the applications can be sped up significantly by finding a substitute workload of smaller size (which is referred to as the compressed workload) as input, while qualitatively not degrading the result of the application. In other words, the result of the application when run on the compressed workload should be identical (or close) to the result when it is run on the original workload. It is important that this compressed workload be found efficiently, since otherwise the very purpose of using a compressed workload is defeated.
One solution to the workload compression problem is to use uniform random sampling to pick a smaller subset of the original workload. While this strategy is efficient, it is not an effective method for workload compression. The key reason for the poor compression achieved by uniform random sampling is that it is oblivious to the application for which the workload is being used, and hence ignores potentially valuable information about the statements in the workload. The invention exploits application knowledge to obtain significantly better workload compression as compared to uniform random sampling.
There have been prior art publications that use sampling in the area of databases e.g., Gibbons, P. B., Matias Y., and Poosala V. Fast Incremental Maintenance of Approximate Histograms. Proceedings of the 17th Intl. Conference on very Large Databases, 1997. and Haas P. J., Naughton, J. F, Seshadri S., and Stokes L. Sampling based estimation of the number of distince values of an attribute. Proceedings of the 21st Intl. Conference on Very Large Databases, 1995. However, these publications address the problem of sampling data and not the workload. Random sampling, has also been studied extensively in the statistics literature. Random sampling ignores interaction among the objects being sampled (which is the primary source of its efficiency).
Clustering has been studied in the context of machine learning and data mining. A class of prior art also exists wherein data points are in a metric space, i.e., each pair of points is assigned a distance that is symmetric and satisfies the triangle inequality. There is also prior art that evaluates query equivalence but not with a goal toward compressing a workload. One form of equivalence is when two queries are semantically identical, i.e., they return the same result. Determining equivalence does, of course, come at a cost since it could require significant computational effort.