Most query optimizers for relational database management systems (RDBMS) rely on a cost model to choose the best possible query execution plan for a given query. Thus, the quality of the query execution plan depends on the accuracy of cost estimates. Cost estimates, in turn, crucially depend on cardinality estimations of various sub-plans (intermediate results) generated during optimization. Traditionally, query optimizers use statistics built over base tables for cardinality estimates, and assume independence while propagating these base-table statistics through the query plans. However, it is widely recognized that such cardinality estimates can be off by orders of magnitude. Therefore, the traditional propagation of statistics that assumes independence between attributes can lead the query optimizer to choose significantly low-quality execution plans.
The query optimizer is the component in a database system that transforms a parsed representation of an SQL query into an efficient execution plan for evaluating it. Optimizers usually examine a large number of possible query plans and choose the best one in a cost-based manner. To efficiently choose among alternative query execution plans, query optimizers estimate the cost of each evaluation strategy. This cost estimation needs to be accurate (since the quality of the optimizer is correlated to the quality of its cost estimations), and efficient (since it is invoked repeatedly during query optimization).
Although the implementation details of commercially available optimizers vary, most if not all optimizers share the same basic structure shown in FIG. 2. For each incoming query, the optimizer maintains a set of sub-plans already explored, taken from an implicit search space. An enumeration engine navigates through the search space by applying rules to the set of explored plans. Some optimizers have a fixed set of rules to enumerate all interesting plans while other implement extensible transformational rules to navigate through the search space. All systems use dynamic programming or memoization to avoid recomputing the same information during query optimization. For each discovered query plan, a component derives different properties if possible, or estimates them otherwise. Some properties (e.g. cardinality and schema information) are shared among all plans in the same equivalence class, while others (e.g. estimated execution cost and output order) are tied to a specific physical plan. Finally, once the optimizer has explored all interesting plans, it extracts the most efficient plan, which serves as the input for the execution engine.
A useful property of a query plan from an optimization perspective is the estimated execution cost, which ultimately decides which is the most efficient plan. The estimated execution cost of a plan, in turn, depends heavily on the cardinality estimates of its sub-plans. Therefore, it is fundamental for a query optimizer to rely on accurate and efficient cardinality estimation algorithms.
Referring now to FIG. 3(a) and given that |R|≈|S|≈|T|, if the query optimizer has knowledge that R.a<10 is much more selective than T.b>20 (i.e., just a few tuples in R verify R.a<10 and most of the tuples in T verify T.b>20), the optimizer should determine the plan P1 in FIG. 3(b) as more efficient than P2 in FIG. 3(c). The reasons is that P1 first joins R and S producing a small intermediate result that is in turn joined with T. In contrast, P2 produces a large intermediate result by first joining S and T.
Cardinality estimation uses statistical information about the data that is stored in the database system to provide estimates to the query optimizer. Histograms are the most common statistical information used in commercial database systems. A histogram on attribute x consists of a set of buckets. Each bucket bi represents a sub-range ri of x's domain, and has associated with it two values: ft and dvi. The frequency ft of bucket bi corresponds to the number of tuples t in the data set for which t.xεri, and the value dvi of bucket bi represents the number of distinct values of t.x among all the tuples t for which t.xεri. The main assumption is that the distribution of tuples inside each histogram bucket is uniform. The uniform spread model inside buckets implies that each bucket bi is composed of dvi equidistant groups of ft/dvi tuples each. The density of a bucket is defined as δi=ft/dvi, i.e., the number of tuples per distinct value (assuming uniformity) that are represented in the bucket. Other techniques for modeling bucket contents are also known such as the continuous or randomized models.
Histograms are currently used to estimate the cardinality of complex query plans as follows. The uniformity assumption inside histogram buckets suggests a natural interpolation-based procedure to estimate the selectivity of range and join predicates. Given the selection query σR a<20(R) and a histogram on R.a, the cardinality of the query is estimated by considering, one at a time, all histogram buckets that are completely or partially covered by the predicate and aggregating all intermediate results.
FIG. 4 illustrates a four-bucket histogram on attribute R.a. Bucket b1 covers 0≦x≦10 and has a frequency of 100 (i.e. it represents 100 tuples in the data set). Similarly, buckets b2, b3, and b4 represent 50, 80, and 100 tuples, respectively. The histogram can be used to estimate the cardinality of the range predicate p=R.a<20. Since p completely includes bucket b1, all 100 tuples in b1 must verify p. Also p is disjoint with buckets b3 and b4, so no single tuple in b2 of b4 verifies p. Finally, p partially overlaps with bucket b2 (in particular, p is verified by 50% of b2's uniformly spread distinct values). Therefore based on the uniformity assumption that 50% of the tuples in b2 verify p, the number of tuples verifying predicate p=R.a<20 is estimated to be 100+50/2=125.
In general, selection queries may have multiple predicates on different attributes on the table. For example, given the query:
SELECT * FROM R
WHERE R.a>10 AND R.b<100
and assuming there are histograms on R.a and R.b available, if sa is the selectivity for R.a>10 and sb is the selectivity for R.b<100, the selectivity for the whole predicate is estimated, assuming independence as sa·sb. Multidimensional histograms have proved to be accurate in modeling attribute's correlation. However, these novel estimation techniques are not widely used in commercial databases yet.
Histograms are used to estimate the selectivity of join queries such as Rx=y S. Histograms on R.x and S.y can be used to improve the accuracy of the cardinality estimation. Referring to FIG. 5, histograms HR.x and HS.y have buckets that are delimited by square brackets. The procedure to estimate the cardinality of the join predicates follows. In Step 1, the histogram buckets are aligned so that their boundaries agree (usually splitting some buckets from each histogram). Buckets b2 and b′2 in FIG. 5 share the same left boundary. However, bucket b2 spans beyond bucket b′2's right boundary. Therefore, we split bucket b2 into two sub-buckets. The left sub-bucket boundary agrees with that of bucket b′2. The right sub-bucket boundary starts at the same position as bucket b′3 but ends before b′3 does. Bucket b′3 is split in the same way, and this procedure continues until all original buckets are aligned as shown in Step 2. This approach at most doubles the total number of buckets in both histograms.
After the buckets are aligned, a per bucket estimation of join sizes is performed. Many techniques can be used to perform this estimation. For example, using the containment assumption, it is concluded that each group of distinct valued tuples belonging to the bucket with minimal number of different values joins with some group of tuples in the other bucket. For instance, in Step 2 of FIG. 5, the three groups of tuples from the upper bucket are assumed to match with three of the five groups of tuples in the lower bucket. The result of joining the pair of buckets is modeled as a new bucket with three distinct values and density 40=2·20. Each distinct value in the resulting bucket represents 40 tuples, which is the product of the original bucket densities. Therefore, the frequency of the new bucket is 120=3·40.
After applying the same procedure to each pair of aligned buckets, the third and last step consists of aggregating the partial frequencies from each resulting bucket to get the cardinality estimation for the whole join.
The techniques discussed with respect to selection and join queries are used when the predicates are directly applied to the base tables that hold the histograms involved. For select project join (SPJ) queries the cardinality estimation requires propagating statistics through predicates. An example of an SPJ query follows:
SELECT * FROM R,S
WHERE R.x=S.y AND S.a<10
Assuming histograms on R.x and S.y and S.a are available, there are two ways to estimate the selectivity of the whole expression, outlined as FIGS. 6(a) and 6(b). In one approach, histograms for R.x and S.y may be used to estimated the selectivity of RS ignoring the predicate S.a<10. Then assuming independence between S.y and S.a, the histogram for S.a is propagated through the join upwards in the tree. The propagated histogram is then used to estimate the selectivity of S.a<10 over the result from RS, to finally obtain the selectivity of σS a<10(RS).
Another approach is to use the histogram for S.a to first estimate the selectivity of σS a<10(S). Then, assuming independence between S.y and S.a, the histogram for S.y is propagated though the selection operator and used together with the histogram of R.x to estimate the selectivity of R(σS a<10(S)). It is important to note that although the two methods above estimate the same expression, i.e., R(σS a<10(S))≡σSa<10(RS), the resulting estimation can be slightly different.