1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular, to learning from empirical results in query optimization.
2. Description of Related Art
(Note: This application references a number of different publications as indicated throughout the specification by mnemonics enclosed in brackets, e.g., [Authorxx], wherein Author is the author""s name (or abbreviation thereof) and xx is the year of publication. A list of these different publications with their associated mnemonics can be found in Section 6 entitled xe2x80x9cBibliographyxe2x80x9d in the xe2x80x9cDetailed Description of the Preferred Embodiment.xe2x80x9d Each of these publications is incorporated by reference herein.)
Computer systems incorporating Relational DataBase Management System (RDBMS) software using a Structured Query Language (SQL) interface are well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Institute (ANSI) and the International Standards Organization (ISO).
In an RDBMS system, queries typically specify what data is to be accessed, rather than how that data is to be accessed. An SQL Query Compiler, and specifically an optimizer function of the SQL Query Compiler, automatically determines the appropriate way to access and process the data referenced in a single SQL query. This is done by considering many possible query execution plans (QEPs), evaluating the estimated cost of each plan, and choosing the cheapest plan in terms of estimated execution cost.
The estimated execution cost is largely dependent upon the number of rows that will be processed by each operator in the QEP. Estimating the number of rows or cardinality after one or more predicates have been applied has been the subject of much research for over 20 years [SAC+79, Gel93, SS94, ARM89, Lyn88]. Typically, this estimate relies on statistics of database characteristics, beginning with the number of tows for each table, multiplied by a filter factor or selectivity for each predicate, derived from the number of distinct values and other statistics on columns. The selectivity of a predicate P effectively represents the probability that any row in the database will satisfy P.
While query optimizers do a remarkably good job of estimating both the cost and the cardinality of most queries, many assumptions underlie this mathematical model. Examples of these assumptions include:
Currency of information: The statistics are assumed to reflect the current state of the database, i.e. that the database characteristics are relatively stable.
Uniformity: Although histograms deal with skew in values for xe2x80x9clocalxe2x80x9d selection predicates (to a single table), there are no products available that exploit them for joins.
Independence of predicates: Selectivities for each predicate are calculated individually and multiplied together, even though the underlying columns may be related, e.g., by a functional dependency. While multi-dimensional histograms address this problem for local predicates, again they have never been applied to join predicates, aggregation, etc. Applications common today have hundreds of columns in each table and thousands of tables, making it impossible to know on which subset(s) of columns to maintain multi-dimensional histograms.
Principle of inclusion: The selectivity for a join predicate X.a=Y.b is typically defined to be 1/MAX(|a|, |b|), where |a| denotes the number of distinct values of column a and |b| denotes the number of distinct values of column b. This implicitly assumes the xe2x80x9cprinciple of inclusionxe2x80x9d, i.e., that each value of the smaller domain has a match in the larger domain (which is frequently true for joins between foreign keys and primary keys).
When these assumptions are invalid, significant errors in the cardinality, and hence cost, estimates result, causing sub-optimal plans to be chosen. The primary cause of major modeling errors is the cardinality estimate on which costs depend. Cost estimates might be off by 10 or 15 percent, at most, for a given cardinality, but cardinality estimates can be off by orders of magnitude when their underlying assumptions are invalid or uncertain. Although there has been considerable success in using histograms to detect and correct for data skew [IC91, PIHS96, PI97], and in using sampling to gather up-to-date statistics [HS93, UFA98], there has to date been no comprehensive approach to correcting all modeling errors, regardless of origin.
Much of the prior literature on cardinality estimates has utilized histograms to summarize the data distribution of columns in stored tables, for use in estimating the selectivity of predicates against those tables. Recent work has extended one-dimensional equi-depth histograms to mote sophisticated and accurate versions [PIHS96] and to multiple dimensions [PI97]. This classical work on histograms concentrated on the accuracy of histograms in the presence of skewed data and correlations by scanning the base tables completely, at the price of high run-time cost. The work in [GMP97] deals with the necessity of keeping histograms up-to-date at very low cost. Instead of computing a histogram on the base table, it is incrementally derived and updated from a backing sample of the table, which is always kept up-to-date. Updates of the base table are propagated to the sample and can trigger a partial re-computation of the histogram, but there is no attempt to validate the estimates from these histograms against run-time actual statistics.
The work of [CR94] and [AC99] are the first to monitor cardinalities in query executions and exploit this information in future compilations. In [CR94], the result cardinalities of simple predicates after the execution of a query are used to adapt the coefficients of a curve-fitting formula. The formula approximates the value distribution of a column instead of employing histograms for selectivity estimates. In [AC99], the authors present a query feedback loop, in which actual cardinalities gleaned from executing a query are used to correct histograms. Multiple predicates can be used to detect correlation and update multi-dimensional histograms [BCG01]. This approach effectively deals with single-table predicates applied while accessing a base table, but the paper does not deal with join predicates, aggregation, and other operators, nor does it specify how the user is supposed to know on which columns multi-dimensional histograms should be created.
Another research direction focuses on dynamically adjusting a QEP after the execution has begun, by monitoring data statistics during the execution (dynamic optimization). In [KDeW98], the authors introduce a new statistic collector operator that is compiled into the plan. The operator collects the row stream cardinality and size and decides whether to continue or to stop the execution and re-optimize the remainder of the plan. Query scrambling in [UFA98] is geared towards the problem of distributed query execution in wide area networks with uncertain data delivery. Here, the time-out of a data-shipping site is detected and the remaining data-independent parts of the plan are re-scheduled until the problem is solved. Both solutions deal with dynamic re-optimization of (parts of) a single query, but they do not save and exploit this knowledge for the next query optimization run.
In light of the above, there is a need in the art for an effective and comprehensive technique for query optimizers to learn from any modeling mistake at any point in a QEP. There is also a need for such learning optimizers to automatically validate cost estimates against actual costs incurred in the execution of queries. The use of validation would allow models of QEPs to be adjusted for better optimization of future queries. Moreover, validation would also allow database statistics to be adjusted to better reflect the characteristics of the database.
To overcome the limitations in the prior art described above, and to overcome other limitations that will become apparent upon reading and understanding the present specification, the present invention discloses a method, apparatus, and article of manufacture for optimizing queries in a Relational Database Management System (RDBMS) by generating a plurality of query execution plans (QEPs) for the query, providing an execution model of each of the QEPs, choosing one of the QEPs for execution based on the model associated therewith, and exploiting an empirical measurement from the execution of the chosen QEP to validate the model associated therewith, by determining whether the model is in error, and by computing one or mote adjustments to the model to correct the determined error.