In database systems, a query optimizer determines the execution plans of queries. For any given query, the query optimizer chooses the plan with the lowest estimated cost. To estimate the cost of a plan, the optimizer uses knowledge of the result sizes of the operators comprising the plan. The optimizer also uses knowledge of the number of distinct values in the output of these operators. More accurate knowledge of these two quantities results in more accurate cost estimates, and hence a better choice of query execution plan.
The result size of an operator and the number of distinct values in its output depend on the data distribution of the query. To estimate these two quantities, database systems use various techniques to approximate input data distributions, such as histograms or sampling. The estimated values provided by these techniques, no matter how accurate, are only approximations that invariably deviate from the true values to some extent. Since query optimizer cost models are highly sensitive to result size and distinct value estimates, the errors in estimating these quantities lead to errors in estimating plan costs. Estimation errors increase as they propagate through the large query execution plans. For table joins, which are used in almost all database queries, the estimation error can be exponential in the number of joins. These errors can ultimately cause the optimizer to choose a sub-optimal plan. However, the plans chosen using these techniques, while not necessarily optimal, are usually “good enough.” These techniques are used because they quickly provide the required estimates. Query optimization, including all the necessary estimation steps, must not take more than a very small fraction of the time to execute the query. Query optimizers therefore use these fast but possibly inaccurate estimation techniques since the performance of the chosen plans is typically acceptable for many applications.
This reasoning assumes that the database system mostly handles ad hoc queries—queries for which optimization must be fast and for which sub-optimal plans are acceptable. This is not true for queries embedded in applications, which comprise a large portion of the workloads handled by modern database systems. These queries can be (and often are) optimized off-line to produce a compiled version of the query that is then used whenever the query is executed. Such optimizations do not necessarily have to be fast since they are off-line processes. Furthermore, these queries are typically executed frequently since they are part of the application. Thus, the cost of optimization is amortized over many executions of the query. Moreover, finding the optimal execution plan in this setting is more important than for ad hoc queries, because the repeated execution of the queries will increase the effect of any savings in execution time. For these embedded queries, the user may be willing to spend more time optimizing the query and obtaining more accurate cost estimates if this results in choosing a faster execution plan.
Several techniques for estimating result sizes have been proposed in the literature. All these techniques deal with estimating the result size of a single selection or join operator. Some of these techniques can also be extended to estimating the number of distinct values in the output of these operators.
One technique for estimating result sizes is to sample the data at query optimization time. One disadvantage of this approach is the overhead it adds to query optimization. Furthermore, sampling cannot be used to accurately estimate the number of distinct values of an attribute. Sampling is more useful for other applications such as building histograms or approximate query processing.
A second technique for estimating result sizes is using histograms. Histograms are now used by most commercial database systems for result size estimation. Although one-dimensional equi-depth histograms are used in most systems, more accurate histograms have been proposed for use with single and multiple dimensions. A novel approach for building histograms based on wavelets is presented in a paper by Y. Matias, J. S. Vitter, and M. Wang entitled “Wavelet-based Histograms for Selectivity Estimation” in Proceedings of the ACM SIGMOD Conference, pages 448-459, 1998. Histograms, by their very nature, only capture an approximation of the data distribution, and incur varying degrees of estimation errors.
A third technique for estimating result sizes is using feedback from the query execution engine. This approach eliminates the overhead of histogram construction and maintenance, so it is potentially less expensive than traditional histograms. However, it is still an approximate technique with significant estimation errors.
The importance of estimating the result size of joins is discussed in the literature as early as 1986. See L. F. Mackert and G. M. Lohman, “R* Optimizer Validation and Performance Evaluation for Local Queries” in Proceedings of the ACM SIGMOD Conference, pages 84-95, 1986. In this paper, it is noted that the cost model used by the R* System query optimizer for nested loop joins is very sensitive to the estimated result size of the join. Inaccurate result size estimation can lead to sub-optimal plans being chosen. This conclusion demonstrates the usefulness of our approach.