Query optimizers in relational database management systems (DBMSs) use database statistics to determine the best query execution strategy for a query. The statistics usually include the number of rows in a table, the number of distinct values for a column, histograms of the distribution of data values in a column, the number of distinct index keys, and the most frequent values in a column. The optimizer uses these statistics to compute the cardinality (i.e. the number of rows processed) at each intermediate step of a query execution plan (QEP). Advanced optimizers also use joint statistics on groups of columns in order to deal with possible correlations between column values. In addition, many query optimizers also utilize statistics for physical parameters, like the clustering of a table or an index, or the height of a B-tree index.
Missing, out-of-date, or otherwise erroneous database statistics cause the optimizer to improperly estimate the cardinalities and costs of the steps in a QEP, which results in a poor choice of QEP and hence unacceptably long query processing times. Unfortunately, it is all too easy for the statistics in a DBMS to deteriorate over time. Database statistics are not always updated incrementally during data manipulations such as insert, delete, update, and load operations, because such incremental maintenance is expensive. Statistics for tables with high data change rates are therefore very likely to be out of date.
Moreover, some statistics require the proper setting of configuration parameters in order to optimally model the database. Example parameters include the number of frequent values and the number of quantiles (histogram buckets) that should be used when creating a synopsis of the data. These parameters heavily depend on the nature of the data, and its distribution. If not configured properly, the optimizer will again use inaccurate estimates and perform poorly.
While query optimizers often do good job of estimating both the cost and the cardinality of most queries, many risky assumptions underlie these mathematical estimates. Examples of these assumptions include currency of information, data uniformity, and independence of predicates. The statistics are assumed to reflect the current state of the database, i.e. that the database characteristics are relatively stable, and that's not always the case. Furthermore, while histograms deal with skew in values for local selection predicates (e.g. to a single table), no available commercial product is known to exploit them for joins.
Regarding independence of predicates, selectivites for each predicate are calculated individually and multiplied together, even though the underlying columns may be related, perhaps by a functional dependency. The selectivity for a join predicate X.a=Y.b is typically defined to be 1/max{|a|, |b|}, where |b| denotes the number of distinct values of column b. This implicitly assumes the “principle of inclusion”, i.e. that each value of the smaller domain has a match in the larger domain. When these assumptions are invalid, significant errors in the cardinality (and hence cost) estimates result, causing sub-optimal plans to be chosen. Thus, the primary cause of major modeling errors is usually the cardinality estimate on which costs depend. Cost estimates might be off by 10 or 15 percent at most for a given accurate cardinality, but cardinality estimates can themselves be off by orders of magnitude when their underlying assumptions are invalid. While multidimensional histograms address this problem for local predicates, they have apparently never been applied to join predicates, aggregations, etc. Also, applications today commonly have hundreds of columns in each table and thousands of tables, making it impossible to know on which subsets of columns to maintain multidimensional histograms.
Configuration and maintenance of these statistics has traditionally been a time-consuming manual operation, wherein a database administrator continually monitors database behavior and refreshes the statistics as the data changes. The administrator must also occasionally adjust the set of statistics that are maintained, and perhaps reschedule statistics collection. As databases increase in size, the administrator's task becomes more difficult to manage.
Several known databases try to solve this problem by providing statistics refresh automation features that are based on data change rate detection. These tools essentially automate statistics refresh on all tables where the percentage of update-delete-insert (UDI) operations exceeds a threshold. Unfortunately, this involves a substantial maintenance overhead cost.
Chaudhuri et al. describe a method used in the Microsoft (R) SQL Server that analyzes the query workload in order to select a set of statistics to maintain, such as multidimensional histograms. This static analysis is performed prior to query execution. Chaudhuri's work is described in “Automating Statistics Management for Query Optimizers”, by S. Chaudhuri and V. Narasayya, in the Proceedings of the 16th International Conference on Data Engineering, San Diego, 2000, which is hereby incorporated by reference.
Zhu et al. describe a piggybacking technique for collecting statistics that observes the data scanned during normal data manipulation language (DML) processing. This technique avoids the asynchronous background refresh of table data statistics, but has the significant drawback of incurring a constant overhead for all DML processing. While the overhead for any one SQL statement may be small, the cumulative overhead can be significant. Zhu's work is described in “A Piggyback Method to Collect Statistics for Query Optimization in Database Management Systems”, by Qiang Zhu et al., in the Proceedings of the 1998 Conference of the Centre for Advanced Studies on Collaborative Research, Toronto, 1998, p. 25, which is hereby incorporated by reference.
In contrast to these approaches, U.S. Pat. No. 6,763,359 describes a database system that learns from empirical performance results gathered during query execution. The query optimizer generates alternative QEPs for executing a query, provides an execution model of each QEP that includes an execution cost estimate, and selects one of the QEPs for execution based on the associated model. However, the model is validated by empirical measurements made during execution. If the measurements indicate the model is in error, then adjustments to the model are computed to correct the error.
Pending patent application U.S. Ser. No. 10/864,463 cited above describes a method for discovering dependencies between database column pairs for improved query optimization. As mentioned above, the assumption that query predicates are independent, when in fact the relevant underlying columns are related, is a major source of error. Thus, data values are sampled randomly prior to query execution, and column pairs are tested for functional dependency and statistical correlation, and accordingly prioritized for recommendation to the query optimizer. Statistics for the recommended column pairs are tracked to improve the estimates of predicate selectivity.
While these accomplishments have improved database performance, an autonomic database system is needed that fully frees the database administrator from the tedium of supervising the collection and maintenance of database statistics. Neither a UDI-driven nor a query feedback (QF)-driven approach is entirely sufficient by itself. UDI-driven approaches are proactive and therefore can handle unforeseen queries, but may not concentrate enough effort on maintaining statistics that are critical to the users' workload. QF-driven approaches are reactive and thus require some learning time, but they focus on the most critical statistics, and hence use system resources very efficiently.