The present invention relates to computer database systems and more particularly to processing queries that specify a join operation.
Relational databases store information in collections of tables, in which each table is organized into rows and columns. FIG. 5(a) illustrates an exemplary database 500 containing four tables, useful in a sales application. Order table 502 records information about orders, identified by an order identifier (OID), that are submitted by customers, identified by a customer identifier (CID). For example, the first row of order table 502 indicates that customer #2 submitted order #1. Customer table 504 records information about customers, identified by a CID, including the ZIP code in which the customer is located. For example, the first row of customer table 04 indicates that customer #2 is located in ZIP code 19555. Assignment table 506 lists which sales associates, identified by a sales associate identifier (SID), are responsible for which customers, identified by the CID. In the example, the first and second rows of assignment table 506 indicate that customer #2 is assigned to sales associates #1 and #2. Finally, sales associate table 508 records information about each sales associates, identified by the SID, including the name of the associates. For example, the first two rows of sales associate table 508 indicate that xe2x80x9cSmithxe2x80x9d is the name for sales associate #1 and xe2x80x9cJonesxe2x80x9d is the name for sales associate #2.
A database user retrieves information from the tables of a relational database by entering input that is converted to queries by a database application. The database application, in turn, submits the queries to a database server. In response to receiving a query, the database server accesses the tables specified in the query to determine which information within the tables satisfies the query. The information that satisfies the queries is then retrieved by the database server and transmitted to the database application and ultimately presented to the user. Database servers are also capable of xe2x80x9cjoiningxe2x80x9d information contained in many of the tables into a single result in response to a query. For example, one query for the exemplary database is to list all the orders of customers in ZIP code 19555 for sales associate Smith.
For any given database application, the queries must conform to the rules of a particular query language. Most query languages provide users with a variety of ways to specify information to be retrieved. For example, in the Structured Query Language (SQL), the following query requests the retrieval of all the orders of customers in ZIP code 19555 for sales associate Smith:
select OID, NAME
from Order 0, Customer C, Assignment A, Sales_Associate S
where O.CID =C.CID and C.ZIP=19555 and
C.CID=A.CID and A.SID=S.SID and
S.NAME=xe2x80x9cSmithxe2x80x9d;
This query performs a xe2x80x9cjoinxe2x80x9d operation on the order table 602, customer table 604, assignment table 606, and sales associate table 608. A join operation combines rows from two or more relational database objects, such as tables, views, or snapshots, that meet a specified criterion, called the xe2x80x9cjoin conditions.xe2x80x9d A join operation is performed whenever multiple tables appear in the FROM clause of query. In this example, four tables are listed. The SELECT list of the query can reference any of the columns from any of the base objects listed in the FROM clause. The join conditions are specified by the predicates in the WHERE clause. A join condition can relate to a single table, for example C.ZIP=19555, or to multiple tables, for example, O.CID=C.CID. A result of the query is depicted in FIG. 5(b). Specifically, query result 510 includes a tuple of order #1 and Smith""s name.
When the join operation is commutative and associative, as in this example, the tables may be joined in any order without affecting the final result. However, while not affecting the final result, the order in which the tables are joined or xe2x80x9cjoin orderxe2x80x9d may have a significant effect on the performance of the database system. For example, joining the order table 502 with the customer table 504 first results in a single tuple, which helps to constrain the amount of input/output in later join operations, but joining customer table 504 with sales associate table 508 results in a Cartesian product of twelve tuples, larger than any of the tables and very expensive to process. Therefore, it is desirable to find a join order that has good performance, because an arbitrary join order, for example, the order listed in the FROM clause, may be prohibitively expensive to process. Techniques have been developed in an attempt to discover a good join order for a database query, but many of these techniques have proven to be inadequate.
For example, one technique, called a xe2x80x9cheuristic approach,xe2x80x9d examines the join conditions and statistical information about the tables. Various rules of thumb or xe2x80x9cheuristicsxe2x80x9d are applied to this information to select a good join order as the desired join order in executing the query. There is much controversy about which heuristic is superior to another, and it appears that the best heuristic may vary from query to query. Thus, a heuristic that is appropriate from some queries may produce a join order that is too expensive for other kinds of queries and miss a good join order.
Another technique is called an xe2x80x9cexhaustive searchxe2x80x9d wherein every possible join order is evaluated for a performance cost and the best cost join order is selected. The possible join orders are produced from an initial join ordering by an enumeration function that permutes the order of the tables in a predetermined, and typically recursive, sequence. A common sequence is a depth-first search sequence, wherein the tail of the initial join ordering is permuted before the head of the initial join ordering is permuted. The performance cost is typically based on the estimated amount input/output that would occur if the query was executed according to each join order. A major difficulty with finding a good join order according to an exhaustive search is apparent when there are many tables to be joined, for example, more than eight. For n tables to be joined, there are n-factorial (n!) number of possible join orders to consider. For example, with nine tables there are 9!=362,880 possible join orders.
Consequently, pruning techniques have been applied to limit the number of join order permutations that are considered. For example, with a depth-first search enumeration function, if the cost of performing the first join operation in a given generated join order is more than the cost of the best join order generated so far, then all join orders with that join operation first in the join order can be safely disregarded. Typically, however, pruning techniques do not disregard a sufficient number of join order permutations to make an exhaustive search feasible, especially for queries with a very large number of joined tables. Consequently, it is not uncommon for database systems to spend ten times as much time determining a good join order for a query than executing the query.
A xe2x80x9ccut-off searchxe2x80x9d is a refinement of an exhaustive search. Using a xe2x80x9ccut-offxe2x80x9d search, the best cost estimate for executing the query (the estimated cost of executing the query using the join order that is currently considered xe2x80x9cbestxe2x80x9d) is continually compared with a cost estimate for performing the search for the best join order. When the search cost exceeds the best cost estimate, then the search is terminated and the current best cost join order is chosen for processing the query. Consequently, the database system employing a cut-off search technique will rarely spent more time determining a good join order than executing the query.
The efficiency of a cut-off search is likely to be sensitive to the initial ordering that is the starting point of the search. The earlier a good plan is found, the earlier the cut-off can occur and the less time will be spent optimizing the query. Conversely, if all the permutations generated early have bad costs, the optimization phase will likely continue longer since there is no good plan available to trigger the cut off. Many enumeration algorithms are history sensitive in that the next join order permutation is determined based on the previous one and will likely generate a join order close to the initial permutation in the search space of different permutations before generating one that is far away. Hence, if the initial permutation is good or is close to a good one in the search space, it is likely that the optimization can be cut off early. On the other hand, a poor initial ordering can delay the cut off or lead to a good join order never being generated if all the good orderings are too far way in the search space to be reached before the search is cut off. Therefore, the initial ordering is crucial for good performance. However, even when the initial ordering is selected according a heuristic approach, such as in an approach described in U.S. Pat. No. 5,758,335, the initial ordering may be poor since there is no single heuristic that will always generate a good initial ordering.
Accordingly, there is a need for efficiently processing queries with multiple join operations. A need exists for a method of determining an efficient join order to process a query. There is also a need for ensuring that the enumeration of join orders is not limited to those in close locality to a poor initial ordering.
These and other needs are addressed by the present invention by generating multiple initial join orders and performing searches upon those multiple initial join orders. The locality effect is ameliorated by use of the multiple initial join orders, because the different initial join orders serve to spread out the search over the search space. In one embodiment, the multiple initial join orders are generated by multiple heuristics, thereby increasing the likelihood that good join orders are considered for a broad range of queries.
One aspect of the invention, therefore, is a computer-implemented method and computer-readable medium bearing instructions for processing a query specifying a plurality of join operations involving tables or other database objects. This methodology includes parsing the query to determine the join operations; determining a join order of the join operations by (1) generating a plurality of initial join orderings, (2) determining local best join orders based on the initial join orderings, and (3) selecting the best of the local best join orders; and executing the query based on the join order. In one embodiment, the initial join orderings are generated according to a plurality of heuristics.
Another aspect of the invention is a computer-implemented method and computer-readable medium bearing instructions for method of determining a desired ordering of join operations on database objects specified by a query to process said query, comprising the computer-implemented steps of: generating a first initial join ordering of the database objects; enumerating a plurality of first join orderings based on the first initial join ordering to search for the desired ordering based on cost estimation; generating a second initial join ordering of the database objects; and enumerating a plurality of second join orderings based on the second initial join ordering to search for the desired ordering based on cost estimation. By generating at least two initial join orderings, locality effects of enumerating join orderings can be ameliorated.
Still other objects and advantages of the present invention will become readily apparent from the following detailed description, simply by way of illustration of the best mode contemplated of carrying out the invention. As will be realized, the invention is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the invention. Accordingly, the drawing and description are to be regarded as illustrative in nature, and not as restrictive.