1. Field of the Invention
This invention relates to automatic query optimization methods in database systems and, more specifically, to a system for adjusting the resources devoted to query optimization responsive to estimated query execution time.
2. Description of the Related Art
Query optimizers operate in relational database management systems to translate queries posed in a non-procedural language such as SQL into an efficiently executable plan. Reference is made to C. J. Date, An Introduction to Database Systems, Vol. 1, Fourth Edition, "Chapter 16: Query Optimization", p. 335, Addison-Wesley Publishing Co., Reading, Mass. (1986), for an overview of the optimization process. According to Date, the overall optimization process can be viewed as four broad stages. These are (1) casting the user query into some internal representation, (2) converting to canonical form, (3) choosing prospective implementation procedures, and (4) generating executable plans and choosing the cheapest of such plans.
The desired query execution plan consists of an ordered series of primitive database operators and is typically developed by choosing the plan having the least estimated execution cost from among several alternative plans making up a "search space". A search space embraces a number of query execution plans that are limited according to the type and sequence of the primitive database operators allowed in the plans. Since only two-way join operators are usually provided as primitives, an optimizer is normally obliged to select the "best" sequence of two-way joins to achieve the N-way join of data tables requested by a non-procedural user query.
Thus, the computational complexity of the optimization process and the execution efficiency of the plan chosen by that process are dominated by the number of such possible primitive operator sequences that must be evaluated by the optimizer. That is, they are dominated by the size of the "search space" spanned by the query optimizer. An exhaustive enumeration is clearly exponential in the number of tables and hence is impractical for all but trivial queries. Thus, some scheme for limiting the search space must be provided to any query optimizer.
The typical query optimizer, for efficiency during optimization, limits the query space of join sequences to be evaluated by (1) using a form of dynamic programming to eliminate sub-plans whose estimated cost is greater than other sub-plans producing equivalent results, (2) restricting the "inner" operand of each join to be a single table, and (3) requiring each join to have at least one equi-join predicate of the form (column1=column2), where (column1) must be in one operand of the join and (column2) is in the other join operand. The restrictive form of this third limitation implies that predicates involving inequality, arithmetic or type conversion or more than two tables do not satisfy the criteria for a "feasible" join sequence and are excluded from the limited search space.
It is easily proven that dynamic programming never eliminates a potentially better plan but the second and third limitations mentioned above are heuristics that may exclude a superior plan without notice under some circumstances. Each query plan is built by combining a set of candidate implementation procedures, one such procedure for each of the primitive database operations in the query. There are normally many feasible plans for any given query and it is often not reasonable to generate all possible plans because of the expense of doing so. The process of establishing primitive database operator selection criteria to reduce the number of query execution plans to be evaluated can viewed as "reducing the search space". Accordingly, a "search space" herein denotes a set of executable query plans that are selected on the basis of some formal or heuristic criteria related to primitive database operators.
When a given query is to be executed repeatedly, additional optimization efforts such as considering more join sequences over a larger search space can be justified by the resulting improvement in execution costs leveraged over multiple query executions. On the other hand, response time for interactive ad hoc queries includes both optimization and execution costs. Thus, heuristics that severely limit the optimizer's search space are usually beneficial for interactive systems. Also, certain knowledge peculiar to an application may generate special heuristics for excluding plans that are dominated in that application but not in others. Thus, there is a clearly felt need for greater flexibility in selecting feasible join sequences over the fixed join selection criteria in query optimizers known in the art.
Practitioners in the art have responded to this need by proposing new methods for improving query optimizer efficiency. Reference is made to K. Ono et al, "Extensible Enumeration of Feasible Joins for Relational Query Optimization", IBM Research Report, RJ 6625 (63936) Dec. 28, 1988, IBM Almaden Research Center, San Jose, Calif., for disclosure of a query optimizer that allows simple adjustment of the number of alternative join sequences that are automatically generated, including a new criterion that requires an explicit or implied join predicate for any join, where such join predicate can be any predicate relating more than one table.
Essentially, Ono et al teach a method for quickly and effectively adjusting the size of the search space by changing the primitive database operator selection criteria. Their extensible join enumeration algorithm enumerates a wider variety of feasible two-way joins than other methods previously known in the art because it exploits predicates that (a) reference more than two tables, (b) involve arithmetic or conversion operators, or (c) are implied by other predicates. Also, the Ono et al method can parametrically adjust the space of join sequences evaluated by the optimizer to allow or disallow (1) composite tables as the inner operand of a join or (2) joins between two tables having no join predicate linking them, such as Cartesian products. Both Cartesian products and composite tables that are themselves a result of a join are usually excluded by query optimizer systems to limit the search space size. The blanket exclusion of these two database operations functions as an inflexible limit on query optimizer flexibility and efficiency.
Reference is also made to Ono et al, "Measuring the Complexity of Join Enumeration in Query Optimization", Proceedings of the 16th VLDB Conference, Brusbane, Australia (1990), pp. 314-325. Ono et al conclude that enumeration of the join sequences for a query is the dominant factor in both the query optimization costs and the quality of the resulting execution plans. Their join enumeration procedure generates better execution plans by enlarging the set of feasible plans to include composite inners and Cartesian products at any place in the join sequences. This allows the search space to be adjusted by making incremental changes to the join sequence selection criteria. However, Ono et al do not consider the problem of automatically varying the search space and do not teach how to ensure global query optimization.
Other practitioners have applied global optimization strategies to the problem of optimization of large join queries. The global optimization problem can generally be considered to be a problem of locating the global optimum in a large space having many local optima. The usual solution is to search through the space while indexing across several of the local optima until such time as the search "should" be terminated according to some criteria. Reference is made to, for instance, Boender, et al, "Bayesien Stopping Rules for Multi-Start Global Optimization Methods", Mathematical Programming, Vol. 37, pp. 59-80, (1987), for a discussion of the art of solving the unconstrained global optimization problem. Reference is also made to A. Swami, et al, "Optimization of Large Join Queries", Proceedings of ACM-SIGMOD International Conference on Management of Data, pp 8-17, June 1988, for a discussion of their application of global optimization techniques to the query optimizer problem Swami et al express the query optimizer problem as an optimal join order problem by using several commonly known heuristics for limiting search space size. They first severely limit the size of their search space and then perform global optimization according to simple iterative improvement and simulated annealing methods. They neither teach nor suggest means for dynamically varying the search space in response to estimated query cost and their method does not guarantee that the universally optimum query plan even exists in their limited search space.
A similar approach is suggested by Xerox Advanced Information Technology researchers in an unpublished memorandum. This Pilot Pass approach uses an initial "pilot" phase of optimization over a limited search space to determine an estimated execution time. The set of alternatives is limited in this first pass by setting an upper bound on the longest constituent operation time in any evaluated plan rather than excluding entire classes of plans to limit the search space. If this longest constituent operation time exceeds the estimated execution time for the optimum plan, then no plan that is excluded by the pilot phase can possibly run in less that the estimated execution time so the plan found in the pilot phase is clearly optimal. Otherwise, a second phase of optimization then considers the entire set of alternative plans by setting the longest constituent operation time threshold to infinity and then pruning any plan whose execution time estimate exceeds the initial pilot pass estimate of query execution time. This Pilot Pass approach does not consider adjusting the size of the search space but rather defines a filter that summarily eliminates a plan from consideration as soon as one of its constituent operations is found to exceed an arbitrary execution time threshold. This approach neither considers nor suggests a method for trading off optimization time automatically in response to prospects for improved execution time because the second "unlimited" search space evaluation is always executed if the longest constituent plan operation time is less than the pilot pass estimate of execution time, even if the estimate for execution time is smaller than the time required to evaluate the "unlimited" search space.
In U.S. Pat. No. 4,769,772, P. A. Dwyer discloses an automated query optimization method using both global and local optimization for distributed databases. Dwyer chooses to optimize locally if such is more efficient than optimizing over the distributed system. She neither suggests nor considers adjusting the size of the search space and uses a fixed set of criteria to establish the local and distributed search spaces without providing any estimation of optimization costs.
In U.S. Pat. No. 4,531,186, J. M. Knapman discloses a user-friendly database access method for writing user queries for a special type of database system. Knapman does not consider the problem of query optimization for large databases, generally limiting his teachings to evaluation and selection of database access paths by constructing a data directory and accordingly choosing application programs and program communication blocks.
In U.S. Pat. No. 4,829,427, N. L. Green discloses a database query code generation and optimization technique responsive to the cost of alternative access methods. Green also does not consider the problem of optimizing query execution plans but limits her teachings to methods for choosing data access paths after a query execution plan is already established. Green neither considers nor suggests methods for adjusting query execution plan efficiency according to the interaction between query plan and data access path.
Reference is made M. S. Lakshmi, et al, "Access Path Selection in Relational Database Systems", IBM Technical Disclosure Bulletin, Vol. 30, No. 9, February 1988, pp. 420-421, for another discussion of data access path optimization limited in application to single variable relational queries in database management systems. Again, Lakshmi et al do not consider the problem of query execution plan optimization but teach instead one of many useful methods for quickly defining an elementary search space.
Reference is made to D. Cornell, et al, "Relational Assignments for Distributed Database Systems", IBM Technical Disclosure Bulletin, Vol. 31, No. 1, June 1988, pp. 219-225, for discussion of a technique for parsing distributed database queries into communication and relational steps for optimization with respect to inter-node communication requirements. Cornell et al propose procedures for decomposing queries into simple relational steps expanded with potential message steps to equalize the work load across multiple systems and are primarily concerned with load sharing rather than query optimization.
Reference is made to S. Brady, et al, "Extended Disjunctive Normal Form for Efficient Processing of Recursive Logic Queries", IBM Technical Disclosure Bulletin, Vol. 30, No. 1, June 1987, pp. 360-366, for a discussion of a procedure that eliminates unnecessary temporary tables during optimization of the type of formal query procedure used in expert system. The Brady et al technique is useful for optimizing queries within a single search space but nether suggests nor considers methods for adjusting the search space responsive to query execution time.
Increasing the set of feasible plans to form a larger search space may improve the chance of locating the universal optimum but does not guarantee it. Unfortunately, increasing a search space size does guarantee an increase in the query evaluation and optimization costs. A major challenge in the design of query optimizer to ensure that the space of feasible plans contains efficient plans without making it too big to be generated practically. The techniques in the art for query optimization do not provide means for automatically adjusting the search space in response to query execution time and do not suggest practical means for guaranteeing inclusion of the universal optimum.
These unresolved problems and deficiencies are clearly felt in the art and are solved by this invention in the manner described below.