1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular, to the optimization of queries that reference joins other than the standard xe2x80x9cinnerxe2x80x9d join, specifically xe2x80x9couterjoinsxe2x80x9d and xe2x80x9cantijoins.xe2x80x9d
2. Description of Related Art
(Note: This application references a number of different publications as indicated throughout the specification by mnemonics enclosed in brackets, e.g., [Authorxx], wherein Author is the author""s name (or abbreviation thereof) and xx is the year of publication. A list of these different publications with their associated mnemonics can be found in Section 7 entitled xe2x80x9cReferencesxe2x80x9d in the xe2x80x9cDetailed Description of the Preferred Embodiment.xe2x80x9d Each of these publications is incorporated by reference herein.)
Computer systems incorporating Relational DataBase Management System (RDBMS) software using a Structured Query Language (SQL) interface are well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Institute (ANSI) and the International Standards Organization (ISO).
Relational join combines information from two base tables by creating pairs of matching rows. Rows without any matches are simply discarded. These kinds of joins are referred to as inner joins. In addition to inner joins, there are two other types of joins commonly seen in relational database systems, namely outerjoins and antijoins.
Outerjoin [Codd79] is a modification of inner join that preserves all information from one or both of its arguments. Outerjoins can be further categorized into left, right (single-sided), or fall (two-sided) outerjoin, depending on which side needs to be preserved. For example, the following SQL query will return all the department names and employees within each department. For those departments without employees, the department names are listed with the employee name set to NULL.
Outerjoins are important because they are frequently used in the following applications [GR97]: (a) certain OLAP queries where we need to preserve rows from the fact table with unknown (or missing) dimensional values; (b) constructing hierarchical views that preserve objects with no children and (c) queries generated by external tools.
Recently, [STH+99] proposed a way of using relational database to handle XML queries. Outerjoins are needed to express XML paths. Outerjoins are also useful for exporting relational data into XML documents [SSB+00]. For example, to generate an XML document describing a customer from relational tables, a potential implementation will issue the following query:
Antijoin is useful for handling negated nested queries. Straightforward evaluation of those queries would require using the nested iteration method, which may be very inefficient. [Kim82] proposed to transform negated nested queries into antijoins. Since join methods other than the nested loops join could potentially be used, this transformation gives the optimizer more freedom. The following example shows such a transformation. An antijoin preserves a row from the outer table if there is no match from the inner table. Otherwise, the row is discarded. Antijoin queries occur a lot in commercial systems. For example, negated nested queries are often used to maintain referential integrity.
When there are only inner joins in a query, a query optimizer considers all possible join orders and selects the cheapest execution plan. Changing the order of join evaluation is a powerful optimization technique and can improve execution time by orders of magnitude. However, when outerjoins and/or antijoins are present in addition to inner joins, changing the order of evaluation is complicated. This is because these three types of joins are not always associative with each other. Two invalid transformations (verified in [RG90]) are shown below:
R LEFT JOIN (S INNER JOIN T)xe2x89xa0(R LEFT JOIN S) INNER JOIN T
R LEFT JOIN (S ANTIJOIN T)xe2x89xa0(R LEFT JOIN S) ANTIJOIN T
As a result, not all orders will give the same answer as the original query, unless special consideration is taken.
The problem of outerjoin reordering has been studied before in [RG90,GR92,BGI95,GR97], with [GR97] being the most comprehensive.
In [GR97], the authors identify a special class of query called simple join/outerjoin queries. A simple query has the property that its query graph (without specifying the join orders) unambiguously determines the semantics of the query. A conflicting set for each join predicate p is then computed through some graph analysis, which includes all join predicates that conflict with p. The information stored in the conflicting set can be used to form proper join orders in a conventional bottom-up join optimizer. Basically, when a join predicate p is used to combine two subplans, the optimizer checks if p conflicts with any join predicates used in either subplan. If so, a generalized outerjoin will be introduced to guarantee the correctness. This is described in more detail in Section 5 below.
There are two limitations in the approach used in [GR97]. First of all, it provides solutions to simple queries only. While simple queries are an important class of query, there are many real-world queries that are not simple. For example, predicates with more than one conjunct, predicates referencing more than two tables, and Cartesian products are not allowed in simple queries. This limits the application of the technique in commercial systems. As a matter of fact, many commercial database systems (e.g., Sybase IQ [Kirk99], Informix [Brown00]) either evaluate outerjoin queries in the order specified by the user or only allow limited reordering. Second, reordering with the presence of antijoins is not considered in [GR97].
The present invention proposes a new reordering approach that can handle a more general class of queries and more types of joins. This reordering is performed in a conventional bottom-up optimizer using dynamic programming [SAC+791]. Commercial systems such as DB2 [IBM99] associate with each join predicate an eligibility list. Normally, the eligibility list of a join predicate includes only those tables that are referenced in this join predicate. During the bottom-up join enumeration, the optimizer checks if there is a join predicate p whose eligibility list is a subset of all the tables in the two subplans to be merged. If so, the two subplans are combined using p. Otherwise, the two subplans cannot be joined (unless a Cartesian product is introduced).
To incorporate reordering with outerjoins and antijoins, the present invention extends the normal eligibility list. For each join predicate, an extended eligibility list (referred to as EEL) is calculated, which includes additional tables referenced in those conflicting join predicates. Intuitively, an EEL gives all the tables needed by a predicate to preserve the semantics of the original query. EELs are precomputed during one traversal of the operator tree of the original query. Such an extension is transparent to the optimizer. Instead of the normal eligibility list, the optimizer now checks the EEL for each join predicate. Although logically equivalent to the approach used in [GR97], this framework can be extended to resolve many practical issues. This is because EELs exploit the order information in the original operator tree. The query graph, on the other hand, has lost all the orders specified in the original query. Additionally, this framework (for the first time) allows reordering with the presence of antijoins.
The preferred embodiment distinguishes two kinds of approaches to the reordering problem (no previous work has explicitly done that). The first approach only allows join orders that are valid under associative rules. So, the optimizer simply refuses to combine subplans using a join predicate whose EEL is not covered. This approach is referred to as xe2x80x9cwithout compensation.xe2x80x9d The second approach is more aggressive. When a join predicate p""s EEL is not covered, it allows two subplans to be combined using p as long as p""s normal eligibility list is covered. Some compensation work needs to be done later to correct the join result. This approach is referred to as xe2x80x9cwith compensation.xe2x80x9d The way compensation is performed is through nullification and best match. It is similar to generalized outerjoins [GR97]. However, the approach of the present invention allows multiple compensations to be merged and to be done at any time, whereas generalized outerjoins must always perform the compensation immediately. Thus, the compensation approach of the present invention considers a superset of plans of generalized outerjoins.
By distinguishing the two approaches, a framework is provided that can be smoothly incorporated into an existing system. The xe2x80x9cwithout compensationxe2x80x9d approach is relatively easy to implement and can be quickly adopted. The xe2x80x9cwith compensationxe2x80x9d approach gives the optimizer more freedom, but takes more implementation effort. A system can consider this approach at a later time.
To overcome the limitations in the prior art described above, and to overcome other limitations that will become apparent upon reading and understanding the present specification, the present invention discloses a method, apparatus, and article of manufacture for reordering outerjoins and antijoins with inner joins in a bottom-up optimizer of a relational database management system (RDBMS).
Each join predicate is associated with a normal eligibility list (NEL) that includes tables that are referenced in the join predicate, and an extended eligibility list (EEL) that includes additional tables that are referenced in conflicting join predicates. An EEL includes all the tables needed by a predicate to preserve the semantics of the original query. An algorithm is designed to calculate the EEL of each predicate using one traversal of the original operator tree.
Two approaches are used to reordering: without compensation and with compensation. The xe2x80x9cwithout compensationxe2x80x9d approach only allows join reorderings that are valid under associative rules. Thus, the optimizer will not combine subplans using a join predicate whose EEL is not covered in the two subplans to be merged.
The xe2x80x9cwith compensationxe2x80x9d approach allows two subplans to be combined using the join predicate, when a join predicate""s EEL is not covered, as long as the join predicate""s NEL is covered. Some compensation is done later to guarantee the correct result.
Compensation is performed through the operations of nullification and best match, which are defined below. Multiple compensations may be merged and performed at any time.