The present invention relates generally to the field of query plan optimization in relational databases and, more specifically, to preprocessing for efficiently optimizing query plans having conjunctive predicates.
When comparing alternative query execution plans (QEPs), a cost-based query optimizer in a relational database management system (RDBMS) needs to estimate the selectivity of conjunctive predicates. The optimizer immediately faces a problem of how to combine available partial information about selectivities in a consistent and comprehensive manner. Estimating the selectivity of predicates has always been a challenging task for a query optimizer in a relational database management system. A classic problem has been the lack of detailed information about the joint frequency distribution of attribute values in the table of interest. Perhaps ironically, the additional information now available to modern optimizers has in a certain sense made the selectivity-estimation problem even harder.
Specifically, consider the problem of estimating the selectivity s1,2, . . . ,n of a conjunctive predicate of the form p1p2 . . . pn, where each pi is a simple predicate (also called a Boolean Factor, or BF) of the form “column op literal”. Here “column” is a column name, “op” is a relational comparison operator such as “=”, “>”, or “LIKE”, and “literal” is a literal in the domain of the column. Some examples of simple predicates are ‘make=“Honda”’ and ‘year >1984’. The selectivity of a predicate p, as known in the art, may be defined as the fraction (or, alternatively, the cardinality referring to the absolute number of satisfying rows) of rows in the table that satisfy p (where p is not restricted to conjunctive form). In typical prior art optimizers, statistics are maintained on each individual column, so that the individual selectivities s1, s2, . . . , sn of p1, p2, . . . , pn are available. Such a query optimizer would then impose an independence assumption and estimate the desired selectivity as s1,2, . . . , n=s1*s2* . . . * sn. This type of estimate ignores correlations between attribute values, and consequently can be inaccurate, often underestimating the true selectivity by orders of magnitude and leading to a poor choice of query execution plan (QEP).
To overcome the problems—such as inaccuracy resulting from ignoring correlations—caused by using the independence assumption, the optimizer can store the multidimensional joint frequency distribution for all of the columns in the database. However, in practice, the amount of storage required for the full distribution is exponentially large, making this approach infeasible. Alternative approaches therefore have been proposed for storage of selected multivariate statistics (MVS) that summarize important partial information about the joint distribution. Proposals have ranged from multidimensional histograms on selected columns to other, simpler forms of column-group statistics. Thus, for predicates p1, p2, . . . , pn, the optimizer typically has access to the individual selectivities s1, s2, . . . , sn as well as a limited collection of joint selectivities, such as s1,2, s3,5, and s2,3,4. The independence assumption is then used to “fill in the gaps” in the incomplete information, e.g., to estimate the unknown selectivity s1,2,3 by s1,2*s3.
The problem, alluded to above, of combining available partial information about selectivities in a consistent and comprehensive manner now arises, however. There may be multiple, non-equivalent ways of estimating the selectivity for a given predicate. FIG. 1, for example, shows possible QEPs (a), (b), and (c) for a query consisting of the conjunctive predicate p1p2p3. The QEP (a) in FIG. 1 uses an index-ANDing operation () to apply p1p2 and afterwards applies predicate p3 by a FETCH operator, which retrieves rows from a base table according to the row identifiers returned from the index-ANDing operator. The optimizer may know the selectivities s1, s2, s3 of the BFs p1, p2, p3. It may also know about a correlation between p1 and p2 via knowledge of the selectivity s1,2 of p1p2. Using independence, the optimizer might then estimate the selectivity of p1p2p3 as sa1,2,3=s1, 2*s3.
FIG. 1 shows an alternative QEP (b) that first applies p1p3 and then applies p2. If the optimizer also knows the selectivity s1,3 of p1p3, use of the independence assumption might yield a selectivity estimate sb1,2,3=s1,3*s2. However, this would result in an inconsistency if, as is likely, sa1,2,3≠sb1,2,3. There are potentially other choices, such as s1*s2*s3 or, if s2,3 is known, then s1,2*s2,3/s2 (the latter estimate amounts to a conditional independence assumption). Any choice of estimate will be arbitrary, since there is no supporting knowledge to justify ignoring a correlation or assuming conditional independence. Such a choice will then arbitrarily bias the optimizer towards choosing one plan over the other. Even worse, if the optimizer does not use the same choice of estimate every time that it is required, then different plans will be estimated inconsistently, leading to “apples and oranges” comparisons and unreliable plan choices
Assuming that the QEP (a) in FIG. 1 is the first to be evaluated, a prior art optimizer might avoid the foregoing problem that consistency might not be achieved by recording the fact that s1,2 was applied and then avoiding future application of any other MVS that contain either p1 or p2, but not both. In the above example, the selectivities for the QEP (c) in FIG. 1 would be used and the ones for QEP (b) would not. The prior art optimizer would therefore compute the selectivity of pip3 to be s1*s3 using independence, instead of using the MVS s1,3. Thus, the selectivity s1,2,3 for QEP (c) would be estimated in a manner consistent with that of QEP (a). In the example illustrated by FIG. 1, when evaluating the QEP (a), the prior art optimizer used the estimate sa1,2,3=s1,2*s3 rather than s1*s2*s3, because the former estimate better exploits the available correlation information, i.e., the correlation between p1 and p2. In general, there may be many possible choices for the estimate of sa1,2,3.
Although an ad hoc method as in the example of FIG. 1 may ensure consistency, it ignores valuable knowledge, e.g., the correlation between p1 and p3. Moreover, this method complicates the logic of the optimizer, because cumbersome bookkeeping is required to keep track of how an estimate was derived initially and to ensure that it will always be computed in the same way when estimating other plans. Even worse, ignoring the known correlation between p1 and p3 also introduces bias towards certain QEPs: if, as is often the case with correlation, s1,3>>s1*s3, and s1,2>>s1*s2, and if s1,2 and s1,3 have comparable values, then the optimizer will be biased towards the QEP (c) plan, even though the QEP (a) plan in FIG. 1 might be cheaper, i.e., the optimizer thinks that the QEP (c) will produce fewer rows during index-ANDing, but this might not actually be the case. In general, an optimizer will often be drawn towards those QEPs about which it knows the least, because use of the independence assumption makes these plans seem cheaper due to underestimation. This problem has been dubbed “fleeing from knowledge to ignorance”.
Another significant problem encountered when estimating selectivities in a real-world database management system is that the given selectivities might not be mutually consistent. For example, the selectivities s1=0.1 and s1,2=0.15 are inconsistent, because they violate the obvious requirement that sX≧sY whenever X⊂Y. In the presence of inconsistent statistics, it may be impossible to find a set of satisfiable constraints for estimating all the selectivities in a consistent and comprehensive manner.
There are two typical causes of inconsistent statistics. First, the single-column statistics are often taken from the system catalogue directly or derived by the optimizer from catalogue statistics. Because collection of accurate statistics can be a highly cost-intensive process, commercial database systems typically compute catalogue statistics using approximate methods such as random sampling or probabilistic counting. Even when the catalogue statistics are exact, the selectivity estimates computed by the optimizer from these statistics often incorporate inaccurate uniformity assumptions or use rough histogram approximations based on a small number of known quantiles. A second cause of inconsistent knowledge is the fact that different statistics may be collected at different points in time, and the underlying data can change in between collection epochs. This problem is particularly acute in more recent prior art systems where some of the MVS used by the optimizer might be based on query feedback or materialized statistical views.