1. Field of Invention
The present invention relates generally to the field of query execution strategy optimization. More specifically, the present invention is related to the robustness of query execution strategies based on validity ranges of query cardinality estimates.
2. Discussion of Prior Art
Database Management Systems (DBMSs) traditionally execute queries against a relational database by optimizing to find the best query execution strategy, commonly known as a query plan. A DBMS uses an optimization model to select a query plan that is expected to have the lowest execution cost (e.g., execute the query in the shortest amount of time). Execution cost is largely dependent upon the number of rows that will be processed by each operator in a query plan, also known as row cardinality. Thus, an optimization model estimates cardinality incrementally, typically beginning with statistics of database characteristics collected prior to the optimization process. These statistics may include the number of rows in each table, histograms for each column as proposed in “Propagation of Errors in the Size of Join Results” by Toannidis and Christodoulakis, “Improved Histograms for Selectivity Estimation of Range Predicates” by Poosala et al., and “Selectivity Estimation without Value Independence” by Poosala and Toannidis, as well as sampled synopses as proposed in “Sampling-Based Selectivity Estimation for Joins—Using Augmented Frequent Value Statistics” by Haas and Swami. However, a DBMS optimization cost model is prone to error and inaccuracies. Thus, a plan chosen as being optimal by a DBMS may actually perform worse than expected. Virtually every commercial query optimizer chooses the best query plan using an optimization cost model that is dependant on accurate cardinality estimation.
While query optimizers do a decent job of estimating the cardinality of rows passing through operators in a query plan, there are assumptions underlying the mathematical models upon which these cardinality estimations are based. The currency of database statistics, parameter markers, and the independence of predicates and attributes are among such assumptions. Outdated statistics and subsequent invalid assumptions may cause significant cardinality estimation errors, which may in turn cause significant errors in the estimation of the execution cost of a query plan. The propagation of such errors from erroneous assumptions can cause sub-optimal query plans to be chosen during optimization. Thus, current cardinality estimation approaches are limited in that they do not address nor provide for unpredictability in query optimization, specifically, the chance that a chosen query plan is significantly less optimal than an optimal query plan, given erroneous cardinality estimates.
Current parametric optimization approaches attempt to address issues concerning unpredictability and problem by dividing value domains specified for each parameter into intervals, and computing another, separate query plan for each combination of these intervals such that the query plan for a particular interval remains optimal for all parameter values within each intervals. Such an approach is described in “Parametric Query Optimization” by Toannidis et al.
However, parametric optimization approaches are limited in that they require the generation of not one query plan but instead, the enumeration of a whole range of query plans that may or may not be optimal under a given combination of parametric settings. Such an approach is not only expensive in terms of implementation, but also in terms of the memory space and processing time required. The cost for parametric optimization grows exponentially with the size of an original query because the number of query plans that need to be generated, stored, loaded, and processed increases exponentially during runtime. Specifically, “Design and Analysis of Parametric Query Optimization Algorithms” by Ganguly and “Parametric Query Optimization for Linear and Piecewise Linear Cost Functions” by Hulgeri and Sudarshan present algorithms to compute all optimal, parametric query plans for linear and piecewise linear cost functions. The algorithms described by Ganguly as well as by Hulgeri and Sudarshan are prohibitively expensive in that computation time and computation space involved increase exponentially with increases in the number of query parameters. These approaches are limited by their basic exponential nature. Moreover, linear cost assumptions are not able to precisely approximate nuanced cost models used in commercial database systems. Cost models are not always smooth, may not have monotonic input cardinalities, and may have discontinuities.
Prior art described in “Dynamic Query Evaluation plans” by Graefe and Ward and “Optimization of Dynamic query evaluation plans”, by Cole and Graefe implements an alternative form of parametric query optimization in the Volcano™ optimizer generator. The main premise of these approaches lies in the introduction of a choose-plan operator to “glue” together multiple alternative query plans. At compile time, if uncertainty exists for any parameter value in a cost function, query plans involving these cost functions are declared as being incomparable. At execution start-up time, actual parameter values are applied to cost functions and all incomparable query plans are re-costed. In this manner, a single, optimal query plan is chosen. The described approach is limited in that there is no premise for how to choose parameters to be marked as being uncertain. If all parameters in a query plan are marked as uncertain, an exponential number of query plans will be generated.
An approach reducing the impact of cardinality estimates by Kabra and Dewitt in “Efficient Mid-Query Re-Optimization of Sub-Optimal Query Execution Plans” describes the use of an ad hoc cardinality error threshold for determining whether to re-optimize a query plan. This approach is limited in that an ad hoc validity threshold is an overestimate for some query plans and an underestimate for others.
U.S. Pat. No. 6,363,371 discloses a method for identifying essential statistics for query optimization in database systems. However, the disclosed method does not determine validity ranges nor does it determine these essential statistics from intersection points of cost functions to obtain the sensitivity of the best plan to cardinality estimates.
The LEO project proposed by Stillger et al. in “LEO—DB2™'s Learning Optimizer” addresses the problem of using query feedback to optimize future queries based on cardinality estimation errors observed during previous query executions. The DEC RDB™ system executes multiple access methods and performs a competitive comparison before selecting one. Neither of these two approaches addresses a robustness measure for a currently running query, namely, the probability that a current query plan will perform sub-optimally given inaccurate cardinality estimates.
Whatever the precise merits, features, and advantages of the above cited references, none of them achieves or fulfills the purposes of the present invention.