One popular form of computerized record-keeping system is the relational database. Between the actual database (i.e., the data as stored for use by a computer) and the users of the system is a software layer known as the relational database management system (RDBMS). The RDBMS is responsible for handling all requests for access to the database, shielding the users from the details of any specific hardware implementation. Using relational techniques, the RDBMS stores, manipulates and retrieves data in the form of table-like relations typically defined by a set of columns or attributes of data types and a set of rows (i.e., records or tuples) of data. The columns may further include restrictions on their data content (i.e., valid domains) and may be designated as a primary key or unique identifier for the relation or a foreign key for one or more other relations.
The standard language for dealing with relational databases implemented by most commercial RDBMSs is the Structured Query Language or SQL. SQL includes both data definition operations and data manipulation operations. In order to maintain data independence a query (i.e., a set of SQL commands) instructs the RDBMS what to do but not how to do it. Thus, the RDBMS includes a query processor for generating various query plans of execution and choosing the cheapest plan. Due to the high-level nature of relational expressions and a variety of implementation techniques, automatic query optimization is possible and often necessary to ensure more efficient query processing.
In accordance with well known query translation processes, an SQL query is processed in stages. The initial stage casts the source query into an internal form such as the Query Graph Model (QGM) following the preliminary steps of lexing, parsing and semantic checking. The goal of the QGM is to provide a more powerful and conceptually more manageable representation of queries in order to reduce the complexity of query compilation and optimization. The internal QGM is a data structure for providing the semantic relationships of the query for use by query translator and optimizer components for rewriting the query in a canonical form. In a next phase, a plan optimizer produces a query execution plan such as by generating alternate plans and choosing a best plan based on estimated execution costs. Finally, a plan refinement stage may be employed to refine the optimum execution plan in accordance with run-time requirements.
SQL query compilation and optimization techniques using the Query Graph Model (QGM) are well known to those skilled in the art and include the teachings of Hamid Pirahesh, Joseph Hellerstein, and Waqar Hasan, “Extensible/Rule Based Query Rewrite Optimization in STARBURST,” Proceedings of ACM SIGMOD '92 International Conference on Management of Data, San Diego, Calif., 1992, incorporated by reference herein (hereinafter Pirahesh et al.). One QGM as discussed in Pirahesh et al. is described briefly herein.
The structure of the QGM is central to the query rewrite mechanism, since “rewriting” a query corresponds to transforming its QGM. The QGM is a graph of nodes (or “boxes”), each representing a table operation whose inputs and outputs are tables. Examples of such operations are SELECT, GROUPBY, UNION, LEFT JOIN, INTERSECT and EXCEPT. In our terminology, the operation SELECT incorporates selection, projection and join (i.e., the simple unnested “SELECT, FROM and WHERE” clauses in SQL). The number of QGM boxes in a query typically ranges from 2 to 40.
A useful QGM known in the art is described by example. FIG. 1 illustrates a graphical representation of a QGM for the following SQL query:
SELECT DISTINCT Q1.PARTNO, Q1.DESCR, Q2.SUPPNOFROM INVENTORY Q1, QUOTATIONS Q2 WHERE Q1.PARTNO = Q2.PARTNO AND Q1.DESCR =‘ENGINE’  AND Q2.PRICE <= ALL (SELECT Q3.PRICE FROM QUOTATIONS Q3   WHERE Q2.PARTNO=Q3.PARTNO)
This query provides information about suppliers and parts for which the supplier's price is less than that of ALL other suppliers. FIG. 1 shows five boxes or nodes on a QGM graph 10. Boxes 12 and 14 are associated with base tables INVENTORY and QUOTATIONS. Box 16 is a SELECT box associated with the main part of the query and box 18 is a SELECT box associated with the sub-query. Box 20 (i.e., Top node) represents the data output table requested by the query. Each box 12, 14, 16, 18 and 20 has two main components a head and a body. Each head (for example head 22 of box 16) describes the output table produced by the box and each body (for example body 24 of box 16) specifies the operation required to compute the output table. Base tables have empty or nonexistent bodies.
With reference to box 16, head 22 specifies output columns PARTNO, DESCR and SUPPNO, as specified in the SELECT list of the query. The specification of these columns includes column names, types, and output ordering information (not shown). The head 22 includes a Boolean attribute called DISTINCT that indicates whether the associated table contains only distinct tuples (head.distinct=TRUE), or whether it may contain duplicates (head.distinct=FALSE).
The body of a box contains a graph where the vertices represented by darkened circles in FIG. 1 represent quantified tuple variables, called QUANTIFIERS. Box 16 includes quantifiers q1, q2, and q4 (respectively 30, 32 and 34). Quantifiers q1 30 and q2 32 range over (i.e., read from) the base tables INVENTORY 12 and QUOTATIONS 14 respectively, and correspond to the table references in the FROM clause of the SQL query. Vertices q1 30 and q2 32 are connected via respective interbox edges 38 and 40 to the heads of the INVENTORY 12 and QUOTATIONS 14 boxes. The edge 42 between q1 30 and q2 32 specifies the join predicate. The loop edge 44 attached to q1 30 is the local predicate (Q1.DESCR=‘ENGINE’) on q1 30. Each interquantifier edge represents a conjunct of the WHERE clause in the query block the conjuncts being represented in the diagram by the labeled rectangle along the edge. Such edges are also referred to as Boolean factors. Quantifier q4 is a UNIVERSAL quantifier, associated with the ALL sub-query in the WHERE clause. This represents that for ALL tuples associated with q4, the predicate represented by the edge between q2 32 and q4 34 is TRUE.
In Box 16, q1 30 and q2 32 participate in joins, and some of their columns are used in the output tuples. These quantifiers have type F (ForEach), since they come from the query's FROM clause. Quantifier q4 34 has type A, representing a UNIVERSAL (ALL) quantifier. SQL's predicates EXISTS, IN, ANY and SOME are true if at least one tuple of the sub-query satisfies the predicate. Hence, all of these predicates are EXISTENTIAL, and the quantifiers associated with such sub-queries have type E. Each inter-box edge is labeled with the quantifier columns that the edge provides from the table the quantifier ranges over. Additionally, quantifiers may be ordered within a box to support asymmetric operators, such as EXCEPT. In QGM, the quantifiers associated with existential and universal sub-queries are called COUNTING quantifiers. SCALAR sub-query quantifiers have the type S, requiring that (1) the sub-query returns at most one row and (2) if the sub-query does not produce any row, a null value will be returned via the S quantifier.
Box 18 represents the subquery SELECT Q3.PRICE FROM QUOTATIONS Q3 WHERE Q2.PARTNO=Q3.PARTNO. Quantifier q3 46 is of type F and ranges over the base table QUOTATIONS 12. Box 18 includes a predicate 48 (Q2.PARTNO=Q3.PARTNO) that refers to q2 32 and q3 46.
The body of every box has an attribute called DISTINCT that has a value of ENFORCE, PRESERVE or PERMIT.ENFORCE means that the operation must eliminate duplicates in order to enforce head.distinct=TRUE. PRESERVE means that the operation must preserve the number of duplicates it generates. This could be because head.distinct=FALSE, or because head.distinct=TRUE and no duplicates could exist in the output of the operation even without duplicate elimination. PERMIT means that the operation is permitted to eliminate (or generate) duplicates arbitrarily. For example, the DISTINCT attribute of Box 18 can have the value PERMIT because its output is used in universal quantifier q4 34 of box 16, and universal quantifiers are insensitive to duplicate tuples.
Like each box body, each quantifier q1, q2, q3, and q4 (30, 32, 46 and 34 respectively) also has an attribute called DISTINCT (not shown) that has a value of ENFORCE, PRESERVE or PERMIT. ENFORCE means that the quantifier requires the table over which it ranges to enforce duplicate elimination. PRESERVE means that the quantifier requires that the exact number of duplicates in the lower table be preserved. PERMIT means that the table below may have an arbitrary number of duplicates. Existential and universal quantifiers can always have distinct=PERMIT, since they are insensitive to duplicates.
In the body, each output column may have an associated expression corresponding to expressions allowed in the select list of the query. These expressions are called head expressions. In FIG. 1, all of these expressions are simply identity functions over the referenced quantifier columns.
DB2™ from IBM Corporation supports derived tables, which are similar to VIEW definitions, and can be defined anywhere a table can be used. In DB2, derived tables and VIEWs, just like queries and sub-queries, have a QGM, with one or many boxes. When a derived table or VIEW is referenced in a query, its QGM becomes part of the QGM graph of the query.
The output of a box can be used multiple times (e.g., a VIEW may be used multiple times in the same query), creating common sub-expressions.
A particular QGM optimization technique termed “subsumption” has been discussed extensively in many research literature and has been widely used. Subsumption is particularly useful for the rewriting of a query to use an existing materialized view, as disclosed in the following publications [incorporated herein by reference]: L. S. Colby, R. L. Cole, E. Haslam, N. Jazaeri, G. Johnson, W. J. McKenna, L. Schumacher, D. Wilhite, Red Brick Vista: Aggregate Computation and Management, Proceedings of the 14th Int'l, Conference on Data Engineering, Orlando, Fla., 1998. R. Bello, K. Dias, A. Downing, J. Feenan, J. Finnerty, W. Norcott, H. Sun, A. Witkowski, M. Ziauddin, Materialized Views In Oracle, Proceedings of the 24th VLDB Conference, New York, 1998, D. Srivastava, S. Dar, H. Jagadish, A. Levy, Answering Queries with Aggregation Using Views, Proceedings of the 22nd VLDB Conference, Mumbai, India, 1996.
In accordance with a query represented by a QGM graph, SELECT box X is said to be subsumed by another SELECT box Y, if the result set of X is a subset of the result set of Y and if the output column set of X is a subset of the output column set of Y. In other words, the result set of X can be rederived using the result set of Y. A simple subsumption situation arises when the predicate set in box X is a superset of the predicate set in Y. The SELECT box X will filter out more rows since it has more predicates than the box Y. For example, consider the following queries L1 and M1:
L1: SELECT * FROM T WHERE C1 >0 AND C2>0M1: SELECT * FROM T WHERE C1 >0
The result set produced by L1 is a subset of the result set produced by M1. The reason is that both queries select rows from the same table T (i.e., T is a common sub-expression), and all predicates in M1 appear in L1. Hence, M1 subsumes L1.
Another simple situation is when the predicate set in box X is more restrictive than the predicate set in Y. For example, consider the following additional query L2:
L2: SELECT * FROM T WHERE C1 > 10
The result set produced by L2 is a subset of the result set produced by M1, and therefore M1 subsumes L2 in addition to subsuming L1.
Query subsumption, in terms of predicate sets, can be defined as follows using query blocks. Assume that a SELECT box L with a single quantifier Q ranges over a common sub-expression (e.g., a table) T, and another SELECT box M with a quantifier Q′ ranges over the same table T, wherein the quantifiers Q and Q′ comprise the table references of the SELECT boxes. In the above example, the predicate sets in L1, L2 and M1 are:L1: Q.C1>0 and Q.C2>0L2: Q.C1>10M1: Q′.C1>0
A predicate can be mapped into another predicate by mapping the column references between two quantifier sets. For example, the predicate set in M1 can be mapped from {Q′) to {Q} resulting in:M1: Q.C1>0
Now, subsumption can be defined precisely on predicates mapped via quantifiers:                If the predicate set in M mapped from {Q′) to {Q) is less restrictive than the predicate set in L, then L is subsumed by M. That is L is a subsumee and M is a subsumer.        
In other words, all the rows produced by L can be found in the result set produced by M.
As mentioned above, the result set by the subsumee can be rederived from the subsumer. In the above examples, in order to obtain the result sets of L1 and L2, one must apply the respective compensation predicates “C2>0” and “C1>10” on the result set of M1.
A query optimization technique for redundant join elimination has also been disclosed in Query Rewrite Optimization in STARBURST,” Proceedings of ACM SIGMOD '92 International Conference on Management of Data, San Diego, Calif., 1992 referred to above. Essentially, if a self join on a table and the join condition involves the primary key of the table, then the join can be removed because the join is really always one-to-one tuple matching between the two references of the given table.
Despite these known optimization techniques of subsumption and redundant join elimination, there is further need to optimize queries by eliminating redundant joins when the join conditions do not involve the primary keys.