1. Field of the Invention
The present invention relates generally to database query processing and optimization and more particularly to top-down rule-based database query optimizers.
2. Description of Background Art
A central issue in the design of database systems is the query processing strategy that is employed. Considerable focus has been placed in this area since a poor strategy can adversely effect the performance of the database system. In SQL and similar query processing languages, a query can be expressed in a variety of different representations. Since the transfer of data that usually resides on secondary storage is slower than such a transfer from main memory, it is imperative that the number of accesses to secondary storage be minimized. Typically, a user writes a query without considering the most efficient manner for realizing the query. This task becomes the responsibility of a query optimizer.
The objective of the query optimizer is to find an execution strategy that causes the result of the query to be produced in the most efficient (xe2x80x9coptimalxe2x80x9d) manner. Optimality is used to denote the best strategy that satisfies a prescribed criteria. Often this criteria is the minimization of a defined metric, such as computational cost. Query optimization is a search process that entails producing a solution space of semantically equivalent expressions that represent the query. The semantically equivalent expressions are generated through the application of rules. The optimizer searches through the solution space finding the optimal solution that best satisfies the defined metric.
A consideration in the design of a query optimizer is the minimization of its execution time as well as the conservation of memory space. The inefficient use of memory space and the execution of needless computations detrimentally affects the query optimizer""s performance. Accordingly, there is a need to minimize the execution time of a query by utilizing efficient search procedures for finding the optimal solution.
Conventional query optimizers utilize a search engine and a database implementor (DBI) to generate an optimal plan for an input query having an optimization goal. The search engine generates a solution space from which an optimal solution or plan is selected. The solution space is defined by a set of rules and search heuristics provided by the DBI. The rules are used to generate solutions and the search heuristics guide the search engine to produce more promising solutions rather than all possible solutions.
The database query is represented as a query tree containing one or more expressions. An expression contains an operator having zero or more inputs (children) that are expressions. The query optimizer utilizes two types of expressions: logical expressions, each of which contain a logical operator; and physical expressions, each of which contain a physical operator specifying a particular implementation for a corresponding logical operator. An implementation rule transforms a logical expression into an equivalent physical expression and a transformation rule produces an equivalent logical expression. The database query is initially composed of logical expressions. Through the application of one or more implementation and transformation rules, the logical expressions in the database query are transformed into physical expressions.
The search engine utilizes a search procedure that generates a xe2x80x9csolutionxe2x80x9d for the database query by partitioning the database query into one or more smaller subproblems where each smaller subproblem can contain one or more expressions. Some of the subproblems form a subtree including other subproblems as inputs. A solution to each subproblem is generated in accordance with an order that generates a solution for each input subproblem before a solution for its associated parent subproblem is generated. The solution for the database query is then obtained as the combination of the solutions for each of the subproblems.
The search procedure utilizes a top-down branch and bound technique for generating solutions for each subproblem. An initial solution is obtained for each subproblem that has an associated cost which is used as an upper bound for considering other candidate solutions. Additional solutions whose associated costs exceed the upper bound are eliminated from consideration. The solution having the lowest cost is selected as the optimal solution.
Solutions are generated through the application of implementation and transformation rules. Transformation rules produce equivalent logical expressions and implementation rules produce physical expressions. Each rule has a pattern and a substitute. A pattern is the before expression that is matched with the expression that is being optimized. A substitute represents the semantically equivalent expression that is generated by applying the rule. A rule""s pattern matches an expression when the expression contains the same operators in the same position as the rule""s pattern. Prior to applying a rule to an expression, all possible bindings that match a rule""s pattern are determined. The purpose of a binding is to find all possible expressions that can match a rule""s pattern in order to generate every possible equivalent expression.
A search data structure is used to store the expressions that are generated during the search process including those that are eliminated from consideration. The search data structure is organized into equivalence classes denoted as groups. Each group includes one or more logical and physical expressions that are semantically equivalent to one another. Initially each logical expression of the input query tree is represented as a separate group in the search data structure. As the optimizer applies rules to the expressions in the groups, additional equivalent expressions, and additional groups, are added. Duplicate expressions are detected before they are inserted into the search data structure.
The search procedure utilizes guidance methods that guide it toward generating more viable plans. The guidance methods produce guidance structures which are heuristics that are used to select rules that will generate more promising solutions. The heuristics capture knowledge of the search procedure which is passed onto later processing stages in order to eliminate generating unnecessary and duplicate expressions.
A problem with the conventional query optimizers, described above, is that when it is presented with a complex query, the conventional optimizers enumerate the plan search space, i.e., the set of all possible execution plans, by recursively applying transformation and implementation rules to existing plans or expressions. Considering all possible join orders, for example, results in an exponential growth in the number of applied rules as the number of tables to be searched increases. Consequently, such conventional query optimizers are unable to optimize such complex queries regardless of how efficient the implementation is. What is needed is a query optimization system and method that can optimize an arbitrarily complex query within a time that is at most linearly proportional to the complexity of the query.
The invention is a system and method for optimizing complex SQL database queries. The query optimizer contains a search engine and a database implementor (DBI) that are used to generate an optimal plan for an input query having specified required physical properties. The search engine generates a solution space from which an optimal plan is selected. The solution space is defined by a set of rules and search heuristics provided by the DBI. The rules are used to generate solutions and the search heuristics guide the search engine to produce more promising solutions rather than all solutions.
The database query is represented as a query tree containing one or more expressions. An expression contains an operator having zero or more inputs that are expressions. The query optimizer utilizes two types of expressions: logical expressions, each of which contain a logical operator; and physical expressions, each of which contain a physical operator specifying a particular implementation for a corresponding logical operator. An implementation rule transforms a logical expression into an equivalent physical expression and a transformation rule transforms a logical expression into an equivalent logical expression. The database query is initially composed of logical expressions. Through the application of one or more implementation and transformation rules, the logical expressions in the database query are transformed into physical expressions resulting in a solution.
In order to prevent the generation of redundant expressions, each rule is classified as being context-free or context-sensitive. A context-free rule is applied once to an expression, while a context-sensitive rule is applied once to an expression for each optimization goal.
A search data structure is used to store the expressions that are generated during the search process including those that are eliminated from consideration. The search data structure is organized into equivalence classes denoted as groups. Each group represents expressions that are equivalent to one another. Equivalence in this sense denotes those expressions that contain semantically equivalent operators, have similar inputs, and require the same characteristic inputs and produce the same characteristic outputs (otherwise referred to as group attributes). The set of characteristic inputs represent the minimum number of values required for the expression""s operator and for any input expressions associated with it. The set of characteristic outputs represent the minimum number of values that the expression supplies to any parent expression associated with the expression.
Each group includes one or more logical expressions, zero or more physical expressions, zero or more plans, and zero or more contexts. The expressions contained in a group are semantically equivalent to one another. A context is associated with an optimization goal and contains reference to (potentially) one optimal solution (plan) and other candidate plans. By explicitly distinguishing between plans and physical expressions, multiple plans can be generated from the same physical expression given different required physical properties.
Initially the group attributes for each logical expression of the input query are determined and used to store each expression in an appropriate group in the search data structure. As the optimizer applies rules to the logical expressions, additional equivalent expressions, plans and groups are added. The group attributes of the newly generated expressions are computed in order to determine whether a duplicate of the newly generated expression is stored in the search data structure. A duplicate expression is one that has the same operator, number of inputs, ordinality of inputs, and group attributes, for example. Duplicate expressions are not inserted into the search data structure.
The search engine can utilize a search procedure to generate a solution by partitioning the database query into one or more subproblems where each subproblem can contain one or more expressions. Some of the subproblems form a subtree having other subproblems as inputs. Each subproblem has an associated set of required physical properties that satisfies the constraints imposed by its associated parent subproblem""s required physical properties. A solution to each subproblem is generated in accordance with an order that generates a solution for each input subproblem before a solution for its associated parent subproblem is generated. The solution for the database query is then obtained as the combination of the solutions for each of the subproblems.
The search procedure utilizes a branch and bound technique for generating solutions for each subproblem. In one embodiment of the present invention, an initial solution is obtained for each subproblem that has an associated cost which is then used as an upper bound for considering other candidate solutions. Additional solutions whose associated costs exceed the upper bound are eliminated from consideration. The solution having the lowest cost is selected as the optimal solution.
The cost associated with each expression is determined as a function of various criteria, described below. Each criterion is weighed in accordance with the context of the expression and the user""s particular computing environment. As such, the weights are adaptive, not static. In this manner, the cost can represent a more accurate estimate of the computational expense associated with executing an expression.
In one embodiment of the present invention, the cost is represented as a vector of 12 cost scalar components which respectively measure the following resource usage associated with a query operator: (1) CPU: A count of CPU instructions, a value of 1 means 1000 machine instructions; (2) Seeks: Number of random input/output""s (I/O""s) and positioning for sequential reads; (3) I/O Transfer: Kilobytes of I/O transferred; (4) Normal Memory: Amount of memory (in Kilobytes) needed to hold working buffers, hash tables, etc.; (5) Persistent Memory: Amount of memory (in Kilobytes) which persists after operator completion; (6) Local Messages: count of the number of local messages; (7) Local Message Transfer: Number of Kilobytes transferred for all local messages; (8) Remote Messages: A count of the number of remote messages; (9) Remote Message Transfer: Number of Kilobytes transferred for all remote messages; (10) Temporary Disk Space: Amount of disk space (in Kilobytes) used for temporary files; (11) Repeat Count: known as number of probes, this value represents the number of times the operator will be executed, typically this value will be 1 (one), but in some cases (e.g. the inner scan of a nested loops join) it can be greater than one; and (12) Blocking/Idle Time: Amount of time spent waiting for child operators in the expression tree.
A plan is generated through the application of one or more rules to a logical expression. Each rule has a pattern and a substitute. A pattern is the xe2x80x9cbeforexe2x80x9d expression that is matched with the expression that is being optimized. A substitute represents a semantically equivalent expression that is generated by applying the rule. A rule""s pattern matches an expression when the expression contains the same operators in the same positions as the rule""s pattern.
The present invention may apply one or more pruning heuristics to the expression, the binding, and/or the substitute. The heuristics identify certain rules that can be eliminated by either not applying the rules and/or not implementing the rules for a given expression and context (if any) based upon one or more flow rates of the expression. The pruning heuristics can eliminate the application of rules based upon the flow rates of the binding or substitute, for example. Examples include (1) not applying (cutting) a MergeJoin rule for a join expression when an inner table is small enough to be stored in a memory space that is allocated for a HashJoin; (2) not applying implementation rules on the substitute of a left-shift rule for an expression if the resulting input data flow rate from the left child of the join is significantly larger in the substitute than in the binding; (3) not applying the join to TSJ (tuple substitute join) rule if the data flow output of the join expression is significantly larger than the data flow input from the inner child of the join expression; or (4) not applying implementation rules on the substitute join expression of a left shift rule if the number of cross products increases and if the data flow rate from the left child is increases. These are a few examples of pruning heuristics that can be invoked to prevent unnecessary application and implementation of rules. Such pruning heuristics can significantly reduce the search space by eliminating nodes before optimizing the expression.
Prior to applying a rule to an expression, one embodiment of the present invention identifies all possible bindings that match a rule""s pattern. The purpose of a binding is to find all possible expressions that can match a rule""s pattern in order to generate every possible equivalent expression.
An expression in the search data structure is stored with pointers representing each input, if any. Each pointer has a link mode that allows it to reference the group associated with the input or a particular input expression in the group, for example. When a pointer""s link mode is in memo mode, each pointer addresses the group of each input. When the link mode is in binding mode, each pointer addresses a particular expression in the input""s group.
In binding an expression that has inputs, the link mode of the expression is set to binding mode. The input pointers of the expression are set to address a particular input expression stored in the associated group, thus forming a specific subtree. Further subtrees or bindings are formed by incrementing the input pointers appropriately to form other subtrees representing a different combination of the input expressions.
Prior to applying a rule to an expression the present invention can also identify the complexity of the query. If the complexity of the query is above a threshold, the present invention determines whether the rule should be applied based upon several factors including the type of rule and the position of the node in the search space. Those rules that need not be applied are randomly pruned. Pruned rules are not applied, while those rules that are not pruned are applied.
The DBI contains search heuristics in the form of guidance methods that select a set of rules for use in generating a plan for each subproblem. A guidance method, e.g., OnceGuidance, is used to prevent the needless generation of redundant expressions that result from the subsequent application of a rule to the rule""s substitute. The OnceGuidance guidance method is used in certain cases to select a set of rules that are applicable to a rule""s substitute without including those rules that will regenerate the original expression.
The search engine utilizes a series of tasks to implement the search procedure. Each task performs a number of predefined operations and schedules one or more additional tasks to continue the search process if needed. Each task terminates once having completed its assigned operations. A task stack is used to store tasks awaiting execution. The task stack is preferably operated in a last-in-first-out manner. A task scheduler is used to pop tasks off the top of the stack and to schedule tasks for execution.
A garbage collection task is scheduled whenever two groups of the search data structure are merged. The merger of two groups occurs as a result of the application of certain rules, such as the xe2x80x9celimination groupbyxe2x80x9d rule. In this case, the elements of one group are merged with the elements of a second group when its determined that the two groups share a common expression. The first group obtains the group identifier of the second group. The garbage collection task is then scheduled to update any references to the first group by any of the expressions in the search data structure and to eliminate any duplicate expressions in the merged groups.
In a preferred embodiment, the query optimizer performs multiple optimization passes. A first pass, using a certain set of implementation rules, is used to generate a first solution having a cost that is used as a threshold in subsequent passes. In one or more subsequent passes, a set of both implementation and transformation rules is applied to generate one or more additional plans each of which has a cost that does not exceed the threshold.