This invention relates to the field of database management and query processing, and in particular to a method for improving a computer's efficiency in executing a received query by recasting the query using one or more bushy joins of the tables involved in the received query.
The use of computer databases to organize information is ever-growing. This growth includes the expansion of fields of data and the relationships among these various fields of data. Correspondingly, the complexity of queries submitted by users to retrieve specific information based on such relationships has also increased.
Conventionally, a database comprises a number of Tables that include rows and columns, each row being an identifiable “instance” of a record in the table, and each column being the value of the attributes included in each row. Data is retrieved from the database by executing a query that defines the selection of records from the database, and the operations that are to be performed on these selected records. The selection of records is performed by applying the search criteria, or “predicates” to the tables of the database. Predicates may address one or two tables. A predicate that addresses one table is termed a “single-table predicate”, wherein a particular value is specified for selection from the table. For example, single-table predicates include such terms as “Table.column=Constant”, “Table.column IN (Constant1, Constant 2, . . . ”, “Table.column BETWEEN (Constant1, Constant2)”, and so on.
A predicate that addresses two tables is termed herein a “linking predicate”, to distinguish it from the selective predicate. A linking predicate establishes a connection between the two tables, and is typically of the form “Table1.columnA=Table2.columnB”, which filters both tables for records in which the value of columnA in Table1 is the same as the value of Table2. Applying a linking predicate forms a table that includes all of the matched records of Table1 and Table2, and is termed a “join” of these tables. A join of tables may also be formed without a predicate, wherein all of the entries of Table2 are joined with Table1.
Queries are typically executed by a sequential execution of the predicates and other operations (such as joins without predicates) in the query, wherein each predicate or other operation is applied to the table formed by the previous predicates or operations in the query. Consequently, the order of processing each join of a query can significantly affect the speed at which the computer that executes the query will provide a result. For example, in a query that addresses two different tables in the database, executing a predicate in the query on a first table that substantially reduces the number of ‘candidate’ records in the resultant table before executing a predicate related to the second table will significantly reduce the number of records that the computer must compare to the second table. Conversely, executing a predicate on the first table that is not expected to significantly reduce the number of candidate records in the resultant table will not significantly reduce the number of records that need to be compared to the second table.
Accordingly, the order of executing predicates in the query is generally determined based on the expected reduction in ‘costs’ of executing subsequent predicates. The total cost of sequentially executing a query is commonly determined based on the cost of executing each predicate, and the cost of accessing and moving data, in light of the order of execution of the predicates. The costs of different orders of execution are determined, and an order of execution that minimizes this cost is selected as the preferred order to execute the particular query. The order of execution of predicates in the query may be represented as an “operation tree”, an “operation plan”, or a “query tree”. The sequential execution of each predicate on the results of the execution of each prior predicate results in a “linear” join tree, such as illustrated in FIG. 1C (detailed further below).
In addition to controlling the order of executing predicates in a query, conventional query processors may also optimize performance by such techniques as “Column Elimination” transformation, which removes any columns in the resultant table that are not subsequently used in the query, and “Predicate Pushdown”, which finds predicates that can be “pushed down” in the operation tree to be applied to a smaller resultant table. Other techniques for optimizing linear operation trees are common in the art. As used herein, the terms ‘optimization’ and ‘least cost’ are used loosely, and should be interpreted as the application of techniques that are either known to reduce costs, or are structured to assess a variety of options and select the least costly option from among the selected variety of options.
U.S. Pat. No. 8,438,152, issued 7 May 2013 to Rafi Ahmed, (hereinafter Ahmed) and incorporated by reference herein, discloses that improvements in performance may be obtained by recasting a linear join tree into a bushy join tree. The bushy tree comprises two or more independent sub-trees that are joined to form the query. Ahmed addresses “snowstorm” queries that can be characterized as having large ‘fact’ tables and small ‘dimension’ tables, and discloses that the sub-joins should be formed by joining large fact tables to small dimension tables, and then choosing the combination of such sub-joins that produces the least costly tree.
However, in a typical query that addresses multiple large fact tables, the evaluation of all such combinations is likely to consume a substantial amount of ‘pre-processing’ to determine the least costly tree. Ahmed does not provide specific guidance for further selecting candidate sub-joins; presumably, all joins between large and small tables should be considered and all the trees formed by different combinations of such joins should be evaluated to select the least costly tree.
In a subsequent technical paper (“Of Snowstorms and Bushy Trees”, Ahmed et al., Proceedings of the VLDB Endowment, Vol. 7, No. 13, included by reference herein), Ahmed notes that a query with five tables provides 1,680 possible permutations (using Ahmed's formula, a query with six tables will provide 30,340 possible permutations). Ahmed discloses that this number can be substantially reduced by pre-grouping the tables, but does not disclose techniques for such pre-grouping. It appears that each large table, with its related small tables, would form each group. However, a query that addressed five large tables would still have 1,680 possible permutations to be evaluated, the cost of which would likely preclude the application of Ahmed's bushy join recasting process.
It would be advantageous to further improve the performance of a computer in executing a database query by providing a bushy tree recasting process that substantially reduces the number of bushy tree combinations that need to be evaluated and compared. It would also be advantageous to provide a bushy tree recasting process that is not dependent on the relative sizes of the tables, and is not limited to the recasting of “snowstorm” queries. By efficiently selecting the bushy tree combinations that are to be evaluated and compared, a bushy tree combination can be provided for execution of the database query that substantially enhances the database management system's performance by reducing the total processing time, the total data transfer time, or other costs in responding to the received query.
These advantages, and others, can be realized by recasting an original query into a set of potential bushy tree operation plans that include the creation of one or more “sub-joins” that provide intermediate resultant “sub-tables”, and using these sub-tables as operands in the other predicates of the query. That is, as contrast to the conventional ordering of joins in the received query by providing a least-cost linear join tree corresponding to the received query, the process of this invention includes the efficient selection of sub-tables that are the result of applying predicates of the received query to select tables of the database, and then using these sub-tables as components of a rewritten operation plan that reduces the costs associated with executing the received query by the database management system. As contrast to the prior art technique of forming sub-joins of every ‘large’ table and its associated ‘small’ tables, the process of this invention selects joins that are likely to improve the execution of the query (hereinafter “selective joins”) without regard to the size of the tables being joined, and without regard to the ‘type’ of query being processed.
In an embodiment of this invention, a received query is analyzed to identify “satellite” tables and “seed” tables. A satellite table is a table of the database that is filtered by a selective predicate, and is connected to only one other table in the database via at least one linking predicate, and a seed table is a table of the database that is connected to a satellite table and at least one other table. Multiple candidate operation plans are formed as bushy trees that include a sub join of each seed table and its connected satellite table. A least cost operation plan is selected for execution from among these candidate operation plans and the conventional least cost linear operation plan based on the costs associated with each operation plan.
Throughout the drawings, the same reference numerals indicate similar or corresponding features or functions. The drawings are included for illustrative purposes and are not intended to limit the scope of the invention.