1. Field of Invention
The present invention relates generally to the field of query plan optimization in relational databases. More specifically, the present invention is related to optimizing query plans having conjunctive predicates.
2. Discussion of Related Art
Estimating the selectivity of predicates has always been a challenging task for a query optimizer in a relational database management system. A classic problem has been the lack of detailed information about the joint frequency distribution of attribute values in a table of interest. Perhaps ironically, the additional information now available to modern query optimizers has, in a certain sense, made the selectivity-estimation problem even more difficult.
Specifically, consider the problem of estimating the selectivity s1,2, . . . , n of a conjunctive predicate of the form p1p2 . . .  pn, where each pi is either a simple predicate, also known as a Boolean Factor (BF) of the form “column op literal”. Here column is a column name, op is a relational comparison operator such as “=”, “>”, or “LIKE”, and literal is a literal in the domain of the column; some examples of simple predicates are ‘make=“Honda”’ and ‘year>1984’. The phrase “selectivity of a predicate p” is used to indicate the fraction of rows in the table that satisfy p. In older optimizers, statistics are maintained on each individual column, so that the individual selectivities s1, s2, . . . , sn of p1, p2, . . . , pn are available. Such a query optimizer would then impose an independence assumption and estimate the desired selectivity as s1,2, . . . , n=s1*s2* . . . *sn. Such estimates ignore correlations between attribute values, and consequently can be wildly inaccurate, often underestimating the true selectivity by orders of magnitude and leading to a poor choice of query execution plan (QEP).
As an example of using the uniformity assumption, assuming ten distinct values in the MAKE column, the selectivity of the predicate p1: ‘MAKE=“Honda”’ is estimated as s1= 1/10. Similarly, with 100 distinct values in the MODEL column and 10 distinct values in the COLOR column, we obtain s2= 1/100 for p2: MODEL=“Accord” and s3= 1/10 for p3: COLOR=“red”. Advanced commercial optimizers can improve upon these basic estimates by maintaining frequency histograms on the values in individual columns.
As indicated previously, in the absence of other information, current optimizers compute the selectivity of a conjunctive predicate using the independence assumption. For instance, p1,2,3=p1p2p3 is the predicate restricting a query to retrieve all red Honda Accords, and the selectivity of p1,2,3 is computed as s1,2,3=s1*s2*s3. In the previous example, the optimizer would estimate the selectivity of red Honda Accords to be 1/10000. As only Honda is the only automobile manufacturer to make Accords, there is a strong correlation between these two columns, and in this case, an actual functional dependency. The actual selectivity of p1,2 must be 1/100. Thus, a more appropriate estimate of the selectivity of p1,2,3 is 1/1000, which is one order of magnitude greater than the estimate using the independence assumption.
Ideally, to overcome the issues associated with the independence assumption, the optimizer should store a multidimensional joint frequency distribution for all of the columns in the database. In practice, the amount of storage required for the full distribution is exponentially large, making this approach infeasible. It has therefore been proposed to store selected multivariate statistics (MVS) that summarize important partial information about the joint distribution. Proposals have ranged from multidimensional histograms on selected columns to other, simpler forms of column-group statistics. Thus, for predicates p1, p2, . . . , pn, the optimizer typically has access to the individual selectivities s1, s2, . . . , sn as well as a limited collection of joint selectivities, such as s1,2, s3,5, and s2,3,4. The independence assumption is then used to “fill in the gaps” in the incomplete information; it is now possible to estimate the unknown selectivity s1,2,3 by s1,2*s3.
A new and serious problem now arises, however. There may be multiple, non-equivalent ways of estimating the selectivity for a given predicate. FIG. 1, for example, shows possible QEPs for a query consisting of the conjunctive predicate p1p2p3. The QEP in FIG. 1(a) uses an index-ANDing operation () to apply p1p2 and afterwards applies predicate p3 by a FETCH operator, which retrieves rows from a base table according to the row identifiers returned from the index-ANDing operator. Supposing that the optimizer knows the selectivities s1, s2, s3 of the BFs p1, p2, p3 as well as about a correlation between p1 and p2 via knowledge of the selectivity s1,2 of p1p2; using independence, the optimizer might then estimate the selectivity of p1p2p3 as sa1,2,3=s1,2*s3.
FIG. 1(b) shows an alternative QEP that first applies p1p3 and then applies p2. If the optimizer also knows the selectivity s1,3 of p1p3, use of the independence assumption might yield a selectivity estimate sb1,2,3=s1,3*s2. However, this would result in an inconsistency if, as is likely, sa1,2,3≠sb1,2,3. Potentially, there can be other choices, such as s1*s2*s3 or, if s2,3 is known, s1,2*s2,3/s2; the latter estimate amounts to a conditional independence assumption. Any choice of estimate will be arbitrary, since there is no supporting knowledge to justify ignoring a correlation or assuming conditional independence; such a choice would arbitrarily bias the optimizer toward choosing one plan over another. Even worse, if the optimizer does not utilize the same choice of estimate every time that it is required, then different plans will be costed inconsistently, leading to “apples and oranges” comparisons and unreliable plan choices.
Assuming that the QEP in FIG. 1(a) is the first to be evaluated, a modern optimizer would avoid the foregoing consistency problem by recording the fact that s1,2 was applied and then avoiding future application of any other MVS that contain either p1 or p2, but not both. In the exemplary figure, the selectivities for the QEP in FIG. 1(c) would be used whereas the selectivities in FIG. 1(b) would not. The optimizer would therefore compute the selectivity of p1p3 to be s1*s3 using independence, instead of using the MVS s1,3. Thus, the selectivity s1,2,3 would be estimated in a manner consistent with FIG. 1(a). Note that, when evaluating the QEP in FIG. 1(a), the optimizer used the estimate sa1,2,3=s1,2*s3 rather than s1*s2*s3, since, intuitively, the former estimate better exploits the available correlation information. In general, there may be many possible choices; the complicated (ad hoc) decision algorithm used by DB2 UDB is described in more detail subsequently.
Cardinality Estimation for Conjunctive Predicates with MVS in DB2
In the absence of a complete set of statistical information, and due to the complexity and the cost of collecting such information, the DB2 UDB optimizer tries to exploit as much statistical data as is available. Shown below are four different scenarios in which it is assumed that the selectivities of simple predicates are known.
Case I: No Statistics on Conjunctive Predicates are Known.
For this trivial case, the selectivity of the conjunctive predicate is the product of the selectivities of the individual predicates.
Example:
                Estimate s1,2,3 given s1, s2, and s3.s1,2,3=s1*s2*s3.Case II: Statistics on the Conjunctive Predicates are Known and there Exists a Conjunctive Predicate Whose BFs Match the Set of Predicates Whose Combined Selectivity is to be Determined.For this simple case, the selectivity of the full conjunctive predicate is the selectivity of the conjunctive predicate whose BFs match the set of predicates in question.Example:        
(a) Estimate s1,2,3 given s1,2,3.s1,2,3=s1,2,3.
(b) Estimate s1,2,3 given s1,2, s2,3, and s1,2,3.s1,2,3=s1,2,3.Case III: Statistics on the Conjunctive Predicates are Known and there Exist Some Conjunctive Predicates Whose Intersection is an Empty Set.In this case, the selectivity of the conjunctive predicate is the product of the selectivities of all conjunctive predicates and the selectivities of the individual predicates not participating in any conjunctive predicate.Example:
(a) Estimate s1,2,3,4,5 given s1,2 and s3,4 s1,2,3,4,5=s1,2*s3,4*s5.
(b) Estimate s1,2,3,4,5 given s1,2,5 and s3,4 s1,2,3,4,5=s1,2,5*s3,4.
Although the ad hoc method described above ensures consistency, it ignores valuable knowledge, for example, the correlation between p1 and p3. Moreover, this method complicates the logic of the optimizer, because cumbersome bookkeeping is required to keep track of how an estimate was initially derived and to ensure that it will always be computed in the same way when costing other plans. Even worse, ignoring the known correlation between p1 and p3 also introduces bias towards certain QEPs: if, as is often the case with correlation, s1,3>>s1*s3, and s1,2>>s1*s2, and if s1,2 and s1,3 have comparable values, then the optimizer will be biased towards the plan in FIG. 1(c), even though the plan in FIG. 1(a) might be cheaper. In general, an optimizer will often be drawn towards those QEPs about which it knows the least, because use of the independence assumption makes these plans seem cheaper due to underestimation. Query optimizers that do not use MVS actually estimate their selectivities for conjunctive queries according to the ME principle: they assume uniformity when no information about column distributions is available, and they assume independence because they do not know about any correlations.
One approach to query optimization is to utilize the maximum entropy principle. The maximum entropy principle models all that is known and assumes nothing about the unknown. It is a method for analyzing the available information in order to determine a unique epistemic probability distribution. Information theory defines for a probability distribution p a measure of uncertainty called entropy: H(p)=−Σipi log pi. The ME principle prescribes selection of the unique probability distribution that maximizes the entropy function H(p) consistently with respect to the known information. Entropy maximization without any additional information uses the single constraint that the sum of all probabilities is equal to one. The ME probability distribution then is the uniform distribution. When constraints only involve marginals of a multivariate distribution, the ME solution coincides with the independence assumption.