1. Field of Invention
The present invention relates generally to optimizing queries in database systems. More specifically, the present invention is related to the estimating compilation time of a query optimizer.
2. Discussion of Prior Art
A query optimizer in a database system translates a non-procedural query into a procedural plan for execution, typically by generating many alternative plans, estimating the execution cost of each, and choosing the plan having the lowest estimated cost. The complexity of an optimizer is mainly determined by a search space and an enumeration algorithm enumerating plans in the search space. In general, increasing the search space of an optimizer improves the chances, but does not guarantee that it will find a better plan. Further, increasing the search space increases the cost (compilation time) for optimizing the query.
A major challenge in the design of a query optimizer is to ensure that the set of feasible plans in the search space contains efficient plans without making the set too big to be generated practically. For that purpose, most commercial database systems often have multiple levels of optimization. For example, as described in the paper by Selinger et al. entitled “Access Path Election in a Relational Database Management System,” a system can have a “low” level of optimization that employs a polynomial-time greedy method or a randomized algorithm, and a “high” level that searches all bushy plans using a conventional dynamic programming enumeration method. “Knobs” within an optimizer (such as limits on the composite inner size and whether Cartesian products are allowed or not) essentially create many additional “intermediate” optimization levels. The higher the optimization level, the better the chance of getting a good execution plan, but the longer the compilation time. Currently, database administrators must decide what the right optimization level is by trying to trade off the estimated compilation time against possible improvements in execution time. To automate such decisions, a meta-optimizer (MOP) is used.
Just as plan execution costs are estimated in a query optimizer, an essential component in a MOP is a compilation time estimator (COTE). FIG. 1 describes how such an estimator can be used in a simple MOP to choose between two levels of optimization. MOP first compiles the query at the low level 102 and obtains an estimate (measured by time) of the execution cost (call it E 104) of the best plan it finds. Then, the MOP calls the COTE to obtain the compilation time estimate (call it C) of the query at the high level 106. MOP then compares E with C 108 to decide whether to reoptimize the query at the high level. For example, if C is larger than E 110, then there is no need for further optimization since the query can complete execution by the time high-level optimization finishes. It is possible that a query can take longer to compile than to execute, especially when the query is complex, yet very selective, or when users only want to see the top n rows. On the other hand, if C is much smaller than E 112, reoptimizing the query at the high level gives the potential of further reducing E with only a relatively small cost (C). A more advanced meta-optimizer can be built by exploiting additional information, such as whether a query is static or dynamic (it's worthwhile to spend more time optimizing a static query, since the query is likely to be executed repeatedly), wherein the meta-optimizer estimates potential gains in plan quality for each optimization level.
A COTE can be used for many other applications. For example, the paper by Kabra et al. entitled “Efficient Mid-Query Re-Optimization of Sub-Optimal Query Execution Plan” teaches that it is useful to evaluate the need for mid-query reoptimization, in which an optimizer tries to generate a new plan in the middle of execution if a significant cardinality discrepancy is discovered. Since reoptimization itself takes time, the decision on whether to reoptimize or not is better made by comparing the execution cost of the remaining work with the estimated time to recompile.
Estimating the compilation time is also very useful for workload analysis tools. Examples of these tools are advisors for indexes (see papers by Chaudhuri et al. titled “Microsoft Index Tuning Wizard for SQL server 7.0” and Valentin et al. entitled “DB2 Advisor: An Optimizer Smart Enough to Recommend its Own Indexes”), materialized views (see paper by Agarwal et al entitled “Automated Selection of Materialized Views and Indexes in SQL Databases”), and partitioning (see paper by Rao et al. entitled “Automating Physical Database Design in a Parallel Database”) that have been built on top of commercial database systems. All these tools spend most of their time compiling (but not executing) a large number of queries in the input workload as part of their tuning analysis, and may run for hours or even days, depending on the workload. A COTE can be used to forecast how long such a tool will take to finish and possibly to show the progress of the tool as well.
In this section, an overview is provided outlining how the dynamic programming search algorithm works. Then, previous attempts to estimate optimization complexity are explained, including why such prior art solutions are not sufficient.
Since the join operation is implemented in most systems as a diadic (2-way) operator (referred to as join, binary join, or join pair in this document), the optimizer must generate plans that transform an n-way join into a sequence of 2-way joins (referred to as join sequences in this document) using binary join operators. For most (inner) joins, the operands may be permuted, i.e., the join operation is commutative and associative. Therefore, complex queries typically have a large number of possible join sequences. The two most important tasks of an optimizer are to find the optimal join sequence as well as the optimal implementation (referred to as join method in this document) for each binary join. Dynamic Programming (DP) was first used for join enumeration in System R (see above-mentioned paper by Selinger et al.). The essence of the DP approach is based on the assumption that the cost model satisfies the principle of optimality, i.e., the subplans of an optimal plan must be optimal themselves. Therefore, in order to obtain an optimal plan for a query joining n tables, it suffices to consider only the optimal plans for all pairs of non-overlapping m tables and n-m tables, for m=1, 2, . . . , n-1.
To avoid generating redundant plans, DP maintains a memory-resident structure (referred to as MEMO, following the terminology used in the paper by Graefe et al. entitled “The Volcano Optimizer Generator: Extensibility and Efficient Search”) for holding non-pruned plans. Each MEMO entry corresponds to a subset of the tables (and applicable predicates) in the query. The algorithm runs in a bottom-up fashion by first generating plans for single tables. It then enumerates joins of two tables, then three tables, etc., until all n tables are joined. For each join it considers, the algorithm generates join plans and incorporates them into the plan list of the corresponding MEMO entry. Plans with larger table sets are built from plans with smaller table sets. The algorithm prunes a higher cost plan if there is a cheaper plan with the same or more general properties for the same MEMO entry. Finally, the cheapest plan joining n tables is returned.
Most of the compilation time is spent on join optimization (which includes both join enumeration and plan generation). Our experimental study on DB2 confirms such an observation. FIG. 2 gives a breakdown of the compilation time for a real query workload. More than 90% of the time is either directly or indirectly spent on generating and saving join plans of different types. Ono and Lohman in the paper entitled, “Measuring the Complexity of Join Enumeration in Relational Query Optimization,” introduced a key finding that optimization complexity is not determined by the total number of complete join trees that can be formed, but by the number of distinct binary joins. This is because the principle of optimality allows smaller subplans (cached in MEMO) to be “shared” by multiple larger plans. For example, for a query joining four tables A, B, C, and D together, the plan for a join between A and B can be used in join (AB,C) as well as in join (AB,D). Hence, they attempt to estimate the compilation complexity of a query by the number of joins enumerated. The underlying assumption being that the cost of optimizing each join is approximately the same. These attempts suffer from the following limitations:                Determining the number of joins from a general join graph is a hard problem. Although there are closed formulas for certain special classes such as the linear and the star-shaped queries (see papers by Ioannidis et al. entitled “Left-Deep vs. Bushy Trees: An Analysis of Strategy Spaces and its Implication for Query Optimization” and the above-mentioned paper by Ono et al.) and polynomial-time algorithms for counting the number of joins for queries with an acyclic join graph (see paper by Galindo et al. entitled “Uniformly-Distributed Random Generation of Join Orders”), counting the number of different joins with cycles in the join graph is as hard as counting Hamiltonian tours in a graph. The problem is #P-complete, which is even harder than NP-Hard (see paper by Jerrum entitled “Counting Trees in a Graph is #P-Complete”). Cycles are common in real queries because of automatic query generation tools as well as implied predicates computed through transitive closure in commercial systems. To make matters worse, optimizers in real systems are typically customized by various “knobs” and may not implement a full dynamic programming algorithm. For example, one such knob may limit the number of tables joined in the composite inner of bushy plans. Another example is that the optimizer may only support free-reordering plans for outerjoins, but not full reordering with compensation such as generalized outerjoins (see paper by Galindo et al. entitled “Outerjoin Simplification and Reordering for Query Optimization”). All these variants make estimating the number of joins a very difficult problem.        A more severe problem is that, even if the number of joins in a query is estimated correctly, the cost per join is far from uniform. Furthermore, each join typically generates a different number of plans depending on various “interesting” properties that need to be kept. Such properties are extensions of the important concept of interesting orders (see above-mentioned paper by Selinger et al.) introduced in System R. Suppose there exist two plans generated for table R, one ordered on R.a (call it p1) and the other is not ordered (call it p2). Also, suppose that p1 is more expensive than p2. Normally, p1 should be pruned by p2. However, if table R can later be joined with table S on attribute a, p1 can actually make the sort-merge join between the two tables cheaper than p2 since it doesn't have to sort R. To avoid pruning p1, System R identified orders of tuples that were potentially beneficial to subsequent operations for that query (hence the name interesting orders), and compared two plans only if they represented the same expression and had the same interesting order. This causes the number of plans generated for each join and stored in MEMO to vary. FIG. 3a illustrates a 3-way join query and the plans kept in the corresponding MEMO structure. A list of plans is stored for each MEMO entry, each carrying a different order property that is still interesting. A DC value is used to represent a “don't care” property value, which corresponds to all retired orders (orders no longer useful for subsequent operations such as joins) or no order at all. The cheapest plan with a DC property value is also stored in each MEMO entry if this plan is cheaper than any other plan with interesting orders. Modifying the query to that in FIG. 3b, by adding an orderby clause, increases the number of interesting order properties that need to be kept in all MEMO entries containing A. By comparing FIG. 3a with FIG. 3b, it is seen that the number of generated join plans changes, even though the join graph is still the same.        
The following references provide for a general teaching in the area of optimization of database queries.
The U.S. Pat. No. 5,301,317 to Lohman et al. discloses a system providing automatic adjustment of resources devoted to query optimization according to estimated query execution time. The described system permits the query optimizer to automatically trade off the time spent estimating the execution cost of alternate query execution plans against the potential savings in execution time that one of those alternate plans may yield. The number of alternate plans considered is adjusted by selecting compile-time parameters and heuristic criteria for limiting the primitive database operators used in the alternate plans, thereby establishing a new search space. The parameters and criteria are adjusted according to the estimate of execution cost for the optimal plan from a first search space.
The U.S. Pat. No. 5,335,345 to Frieder et al. discloses a method for executing a query comprising a sequence of operations to be performed on one or more relational databases. The steps of the method comprise statistically sampling the relational databases at the times the operations are to be executed and then dynamically optimizing the performance of the operations based on the statistical samples obtained as a result of the sampling step.
The U.S. Pat. No. 5,412,806 to Du et al. discloses a programmable machine system and method for managing electronic data access among multiple different relational databases in a network distributed database environment. The machine is programmed so that it can construct cost-effective access strategies for any of the participating databases absent any DBMS-specific cost models. The system provides query optimization across different database management systems in a network-distributed database environment based on a calibrating database relying only on typical relational database statistics. Cost data is developed by running queries in the various databases against the calibrating database.
The U.S. Pat. No. 5,694,591 to Du et al. discloses a method for optimizing data retrieval from a multi-database system by restructuring a database query tree to optimize query response time in a two-step optimization process.
The U.S. Pat. No. 6,026,391 to Osborn et al. discloses a query performance prediction (QPP) module that is a part of an applications layer residing on respective user stations operating in conjunction with a centralized host computer system. The QPP module correlates estimated system cost information (for a database query provided from a database management system associated with the computer system) with statistics compiled from previous queries in order to estimate the system response time.
The U.S. Pat. No. 6,345,267 to Lohman et al. discloses a relational database management system that includes a query processor that uses a query operator partition property to perform QEP pruning and to ensure that data input to a query operator is partitioned appropriately for the operation. The partition property indicates the group of network nodes across which a table is distributed. The query processor also makes use of partition classes that are designated “interesting classes” to perform preoptimization planning, query pruning, and look-ahead partitioning (based on partition classes that are identified as being of interest to future operations), thereby more efficiently evaluating complex query statements in an MPP, shared-nothing environment.
The U.S. Pat. No. 6,351,742 to Agarwal et al. provides for a method and system for optimizing the execution of database statements are described. An execution plan is generated for a database statement, wherein the database statement contains a predicate having one or more arguments whose values are not known at the time the execution plan is generated (e.g., at compile time).
The U.S. Pat. No. 6,353,818 to Carino provides for a method comprising the steps of generating a plurality of query plans for the database query, evaluating the plurality of query plans using a measured value for the resource metric, selecting a query plan from the evaluated query plans based on the measured resource metric, and executing the selected query plan.
The U.S. Pat. No. 6,397,204 to Liu et al. provides for a method for joining a multi-column table and at least two satellite tables, wherein each satellite table is comprised of multiple rows and at least one join column and each multi-column table is comprised of multiple rows and join columns corresponding to the join columns in the satellite tables.
The US patent publication to Haas et al. (2001/0037327) provides for selectivity estimation for processing SQL queries containing a HAVING clause in an SQL query.
The paper by Polyzotis et al. entitled “Statistical synopses for graph-structured XML databases” provides for statistical synopses for graph-structured XML databases. In the paper entitled, “Query processing and optimization on Oracle Rdb”, Antoshenkov et al. teach query processing and optimization in the Oracle® database. In the paper entitled, “A comparison of selectivity estimators for range queries on metric attributes”, Blohsfeld et al. provide a comparison of nonparametric estimation methods for computing approximations of the selectivites of queries, in particular, range queries. In the paper entitled, “Selectivity estimation for spatio-temporal queries to moving objects”, Choi et al. provide for a query optimizer that requires selective estimation of a query to choose the most efficient plan. Swami in the paper entitled “Optimization of Large Join Queries: Combining Heuristics and Combinatorial Techniques” teaches the use of heuristics in optimizing queries with a large number of joins. Ioannidis et al., in the paper entitled “Randomized Algorithms for Optimizing Large Join Queries,” teach query optimization for relational database systems as a combinatorial optimization problem.
The IBM technical disclosure bulletin entitled “Heuristic Method for Joining Relational Database Tables” provides for a method (specifically, a greedy method) for joining tables responsive to relational queries. The disclosed method iteratively estimates the cost of all joins it can perform and identifies the cheapest join, without concern for the impact that a particular choice has on future joins.
Whatever the precise merits, features, and advantages of the above cited references, none of them achieves or fulfills the purposes of the present invention.