Most query optimizers for relational database management systems (RDBMS) rely on a cost model to choose the best possible query execution plan for a given query. Thus, the quality of the query execution plan depends on the accuracy of cost estimates. Cost estimates, in turn, crucially depend on cardinality estimations of various sub-plans (intermediate results) generated during optimization. Traditionally, query optimizers use statistics built over base tables for cardinality estimates, and assume independence while propagating these base-table statistics through the query plans. However, it is widely recognized that such cardinality estimates can be off by orders of magnitude. Therefore, the traditional propagation of statistics that assumes independence between attributes can lead the query optimizer to choose significantly low-quality execution plans.
The query optimizer is the component in a database system that transforms a parsed representation of an SQL query into an efficient execution plan for evaluating it. Optimizers examine a large number of possible query plans and choose the best one in a cost-based manner. For each incoming query, the optimizer iteratively explores the set of candidate execution plans using a rule-based enumeration engine. After each candidate plan or sub-plan is generated, the optimizer estimates its execution cost, which in turn refines the exploration of further candidate plans. Once all “interesting” plans are explored, the most efficient one is extracted and passed on to the execution engine.
The cost estimation module is critical in the optimization process, since the quality of plans produced by the optimizer is highly correlated to the accuracy of the cost estimation routines. The cost estimate for a sub-plan, in turn, depends on cardinality estimations of its sub-plans. Traditionally, query optimizers use statistics (mainly histograms) that are built over base tables to estimate cardinalities. Histograms are accurate for estimating cardinalities of simple queries, such as range queries. For complex query plans, however, the optimizer estimates cardinalities by “propagating” base-table histograms through the plan and relying on some simplifying assumptions (notably the independence assumption between attributes).
The sub-plan shown in FIG. 3 can be used to demonstrate how base table histograms are used to estimate the cardinality of a SELECT-PROJECT-JOIN (SPJ) query. The query corresponding to the plan shown in FIG. 3 is SELECT*FROM R,S WHERE R.x=S.y AND S.a<10. If histograms over R.x, S.y, and S.a are available, the histograms over R.x and S.y are first used to estimate the cardinality of RS and then the histogram over S.a is propagated through the join upwards in the join-graph. The propagated histogram is then used to estimate the cardinality of Sa<10 over the intermediate result RS, to finally obtain the cardinality of σS.a<10 (RS).
When the cardinality estimation technique illustrated in FIG. 3 is used in query optimizers, three simplifying assumptions are relied on: independence, containment, and sampling. The independence assumption is relied on when propagating histogram HS.a over S.a through the join predicate Rx=yS. This means that the bucket frequencies for HS.a are uniformly scaled down so that the sum of all frequencies in the propagated histogram equals the estimated cardinality of Rx=yS. Implicit in this procedure is the assumption that distributions of attributes in R and S are independent.
The containment assumption is relied upon when estimating the cardinality of joins using histograms. The buckets of each histogram are aligned and a per-bucket estimation takes place, followed by an aggregation of all partial results. The containment assumption dictates that for each pair of buckets, each group of distinct valued tuples belonging to the bucket with the minimal number of different values joins with some group of tuples in the other bucket. For instance, if the number of distinct values in bucket bR is 10, and the number of distinct values in bucket bS is 15, the containment assumption states that each of the 10 groups of distinct valued tuples in bR join with one of the 15 groups of distinct valued tuples in bS.
Random sampling is a standard technique for constructing approximate base-table histograms. Usually the approximated histograms are of good quality regarding frequency distribution. However, estimating the number of distinct values inside buckets using sampling is difficult. The sampling assumption states that the number of distinct values in each bucket predicted by sampling is a good estimator of the actual of distinct values.
Often one or more of the simplifying assumptions do not reflect real data values and distributions. For instance, many attributes are actually correlated and the independence assumption is often inaccurate. Therefore, the optimizer might rely on wrong cardinality information and therefore choose low quality execution plans. More complex queries (e.g., n-way joins) only exacerbate this problem, since estimation errors propagate themselves through the plans.