1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular to the optimization of SQL queries in a relational database management system having an EXISTS subquery and a NOT EXISTS subquery.
2. Description of Related Art
Databases are computerized information storage and retrieval systems. A Relational Database Management System (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. RDBMS software using a Structured Query Language (SQL) interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Organization (ANSI) and the International Standards Organization (ISO).
In RDBMS software all data is externally structured into tables. The SQL interface allows users to formulate relational operations on the tables either interactively, in batch files, or embedded in host language, such as C, COBOL, etc. Operators are provided in SQL that allow the user to manipulate the data, wherein each operator operates on either one or two tables and produces a new table as a result. The power of SQL lies on its ability to link information from multiple tables or views together to perform complex sets of procedures with a single statement.
One of the most common SQL queries executed by RDBMS software is the SELECT statement. In the SQL standard, the SELECT statement generally has the format: xe2x80x9cSELECT less than clause greater than FROM less than clause greater than WHERE less than clause greater than GROUP BY less than clause greater than HAVING less than clause greater than ORDER BY less than clause greater than .xe2x80x9d The clauses generally must follow this sequence. Only the SELECT and FROM clauses are required and all other clauses are optional.
Generally, the result of a SELECT statement is a subset of data retrieved by the RDBMS software from one or more existing tables stored in the relational database, wherein the FROM clause identifies the name of the table or tables from which data is being selected. The subset of data is treated as a new table, termed the result table. The WHERE clause determines which rows should be returned in the result table. Generally, the WHERE clause contains a search condition that must be satisfied by each row returned in the result table. The rows that meet the search condition form an intermediate set, which is then processed further according to specifications in the SELECT clause.
The search condition typically comprises one or more predicates, each of which specifies a comparison between two values from certain columns, constants or correlated values. Multiple predicates in the WHERE clause are typically connected by Boolean operators.
The SELECT statement may also include a grouping function indicated by the GROUP BY clause. The GROUP BY clause causes the rows in the intermediate result set to be grouped according to the values specified in the clause. A number of column or aggregate functions are also built into SQL, such as MAX (maximum value in column), MIN (minimum value in column), AVG (average of values in column), SUM (sum of values in column), and COUNT (number of rows).
Queries using aggregate expressions return as many result rows as there exist unique xe2x80x9cresult groupsxe2x80x9d in the source of the aggregation. A result group is defined as the set of values contained in the fields of a row corresponding to the list of columns or expressions specified in the GROUP BY clause of the query. The value of the aggregate function is the cumulative result of applying the function to the result of the contained expression for each row having the same result group.
Another operation permitted by SQL is the JOIN operation, which concatenates horizontally all or parts of two or more tables to create a new resulting table. The JOIN operation is implied by naming more than one table in the FROM clause of a SELECT statement.
An SQL query generally includes at least one predicate, which is an SQL expression that can assume a logical value of TRUE, FALSE, or UNKNOWN. A predicate typically either specifies a data range, tests for an existence condition, tests for equivalence, or performs a similar table comparison operation.
In RDBMSs, columns of any type can assume NULL (i.e., unknown) values. In RDBMS software, NULL values are properly handled using tri-value logic (i.e., TRUE, FALSE or UNKNOWN) for predicates, and SQL-based RDBMSs employ such logic.
SQL queries and their resulting table operations can be nested through several levels of predicates such that a higher nested predicate, or level of operation, cannot be evaluated until after a lower level predicate, or operation, has been evaluated. A lower level of SQL operation in an SQL statement is generally referred to as a subquery.
Nested SQL statements may require tuple-by-tuple data manipulation in each subquery for evaluation of the complete statement. For example, each entry of a table column may need to be compared against each entry of a corresponding column in another table to determine if a SELECT operation should retrieve a table row. Such tuple-by-tuple operations are very inefficient and requires simplification and optimization.
In an RDBMS system, users submit SQL queries. The optimizer in the RDBMS optimizes the SQL query and generates an efficient execution plan. Typically, the optimizer adopts a cost-based approach wherein the optimizer generates many possible alternative execution plans, and selects the best/cheapest plan among those generated plans.
Graphical Representation of SQL Queries by Query Graph Model (QGM)
A useful tool in describing SQL queries and their evaluation is a graphical representation of SQL statements known as Query Box Representation (QBR). Known SQL optimizers employ various types of QBR to facilitate their processing and optimization efforts. QBR provides a powerful and manageable representation of queries used to reduce the complexity of query compilation and optimization procedures. Additionally, QBR provides a tool for optimizing queries by facilitating rewriting the QBR components in efficient arrangements.
Generally, a QBR representation employs hierarchically arranged xe2x80x9cboxesxe2x80x9d to represent subquery operations, such as SELECT and GROUP BY. The interconnections between the boxes define the relationship between different items of data from different subquery operations.
Various techniques may be used to store a complex query in computer memory in accordance with a QBR format. Specifically, multiple data structures in the memory are used to represent the QBR boxes and interrelationships. These data structures may comprise, for example, records and associated pointers, strings, stack arrangements, doubly linked lists, hashing arrangements, or other suitable formats. Preferably, records and pointers are used, where each record represents a QBR box, and the pointers represent the relationships between the boxes. Each record also includes a listing of various properties associated with the corresponding QBR box.
A number of different types of QBR are known in the art. One of the more popular types of QBR is known as the Query Graph Model (QGM), well known in the art, and described in the following reference, which is incorporated by reference in its entirety: Pirahesh et al., xe2x80x9cExtensible/Rule Based Query Rewrite Optimization in Starburst,xe2x80x9d Proceedings of ACM SIGMOD ""92 International Conference on Management of Data, San Diego, Calif., U.S.A., 1992.
The query graph model supports arbitrary table operations where the inputs are tables and outputs are tables. Such operations are SELECT, GROUP BY, UNION, INTERSECT, and EXCEPT operations, as defined in the SQL standard, and the restriction, projection and join operations performed by the SELECT operation.
When a query is represented by the query graph model, each operation of the query may be graphically represented by a box having a head and body. The body of a query graph model box may include xe2x80x9cquantifiersxe2x80x9d, corresponding to table references or derived tables. Quantifiers are labeled with the columns that it needs from the table it ranges over. Quantifiers may have different types. Type xe2x80x9cFxe2x80x9d (For Each) quantifiers correspond to the table references in the FROM clause of the SQL query. Type xe2x80x9cAxe2x80x9d (Universal) quantifiers correspond to the ALL subquery in the WHERE clause. Type xe2x80x9cExe2x80x9d (Existential) quantifiers correspond to the subquery associated with SQL""s predicates EXISTS, IN, ANY, and SOME, which are true if at least one tuple of the subquery satisfies the predicate. In QGM, the quantifiers associated with existential and universal subqueries are called COUNTING quantifiers. Type xe2x80x9cSxe2x80x9d (Scalar) quantifiers correspond to scalar subqueries. The database system has to enforce the rule that exactly one tuple has to be returned from the scalar subquery. If the subquery does not produce any rows, a tuple with null value(s) will be returned via the S quantifier. On the other hand, if the subquery returns more than one row, a runtime error will be raised. Additionally, quantifiers may be ordered within a box to support asymmetric operators, such as EXCEPT.
The query graph model boxes are coupled to each other by quantifier xe2x80x9cedgesxe2x80x9d, also called xe2x80x9carcsxe2x80x9d. The head of a box describes the output data stream or table produced by the operation, and the body specifies the operation required to compute the output table from records flowing into the body. The head specifies each output column, such as those specified in the SELECT clause of a query, and includes column names, types, and output ordering information. The head, body, and quantifiers of a box may each have a separate attribute called xe2x80x9cDISTINCTxe2x80x9d, to manage the production of duplicate tuples.
FIG. 1 illustrates an exemplary query graph model corresponding to the following SQL query.
SELECT DISTINCT Q1.PARTNO, Q1.DESCR, Q2.SUPPNO
FROM INVENTORY Q1, QUOTATIONS Q2
WHERE Q1.PARTNO=Q2.PARTNO AND Q1.DESCR=xe2x80x98ENGINExe2x80x99
AND Q2.PRICE less than =ALL (SELECT Q3.PRICE
FROM QUOTATIONS Q3
WHERE Q2.PARTNO=Q3.PARTNO)
This query provides information about suppliers and parts for which the supplier price is less than that of all other suppliers. The corresponding query graph model of FIG. 1 contains four boxes 100, 101, 102, 103. Two of the boxes 100 and 101 are associated with the tables INVENTORY and QUOTATIONS, respectively. The third box 102 is a SELECT box representing the main part of the query, and the box 103 is a SELECT box representing the subquery.
In accordance with the query graph model convention, each of the boxes 100-103 has a head and a body, where each head describes the output columns produced by the associated box, and the body specifies the operation required to compute the output. xe2x80x9cBasexe2x80x9d tables, exemplified by boxes 100-101, are considered to have empty or nonexistent bodies because they do not perform any operations.
Referring in greater detail to the SELECT box 102, a head 105 is shown, specifying output columns PARTNO, DESCR, and SUPPNO which correspond to the SELECT clause of the main query. The specification of these columns includes column names, types, and output ordering information. The head 105 includes a distinct attribute 106, which indicates that the associated table contains only distinct tuples and may not have duplicates.
The body of each non-base table box contains a graph, the vertices of which represent quantified tuple variables, the quantifiers. The vertices, which are represented with dark circles, are also called xe2x80x9cnodesxe2x80x9d. In the box 102, for example, vertices 108-110 correspond to quantifiers Q1, Q2, and Q4. Quantifiers Q1 and Q2 range over the INVENTORY and QUOTATIONS base tables respectively, and correspond to the table references in the FROM clause of the SQL query.
As stated above, arcs (also called xe2x80x9cinter-box edgesxe2x80x9d) connect quantifiers to other boxes or to each other. For example, the Q1 and Q2 quantifiers (nodes 108-109) are connected via arcs 112-113 to the heads of the boxes 100-101, respectively. Arcs may also interconnect quantifiers of the same box where each inter-quantifier arc represents a predicate of the WHERE clause in the query block. Such predicates are represented by labeled rectangles along the corresponding edge. Such edges are also called Boolean factors. For example, the Q1 and Q2 quantifiers are interconnected by an edge 116 that represents a join predicate. A looped edge 118 attached to the Q1 (node 108) represents a local predicate on the quantifier Q1. In the head 105 of the box 102, each output column may have an associated expression corresponding to expressions allowed in the SELECT clause of the query. In FIG. 1, all of these expressions are simple identity functions over the referenced quantified columns.
Quantifier Q4 is a UNIVERSAL quantifier is of type A, because it is associated with the ALL subquery in the WHERE clause. This means that for ALL tuples associated with Q4, the predicate represented by the edge between Q2 and Q4 is TRUE.
In Box 3, Q1 and Q2 participate in joins, and some of their columns are used in the output tuples. These quantifiers have type F (ForEach), since they come from the query""s FROM clause. Box 4 represents the example""s subquery. It contains an F quantifier Q3 over the QUOTATIONS table, and has a predicate that refers to Q2 and Q3.
The body of every box in QGM has an attribute called DISTINCT that may have one of the following values: ENFORCE, PRESERVE or PERMIT. ENFORCE means that the operation must eliminate duplicates in order to enforce head.distinct=TRUE. PRESERVE means that the operation must preserve the number of duplicates it generates. This could happen because head.distinct=FALSE, or because head.distinct=TRUE and no duplicates could exist in the output of the operation, even without duplicate elimination. PERMIT means that the operation is permitted to eliminate (or generate) duplicates arbitrarily. For example, the DISTINCT attribute of Box 4 can have the value PERMIT, because its output is used in a universal quantifier (Q4 in Box 3), and universal quantifiers are insensitive to duplicate tuples.
Like the body of every box, each quantifier also has an attribute called DISTINCT that may have a value of ENFORCE, PRESERVE or PERMIT. ENFORCE means that the quantifier requires the table over which it ranges to enforce duplicate elimination. PRESERVE means that the quantifier requires that the exact number of duplicates in the lower table be preserved. PERMIT means that the table below may have an arbitrary number of duplicates. Existential and universal quantifiers can always have DISTINCT PERMIT, since they are insensitive to duplicates.
RDBMS software supports view definitions, and a view can be used anywhere a table can be used. Views, just like queries and subqueries, may be represented with a QGM, with one or many boxes. When a view is referenced in a query, its QGM becomes part of the QGM graph of the query. Derived tables, which are similar to view definitions, can be defined anywhere a table can be used, and their QGM becomes part of the QGM graph of the query.
In the body of every box, each output column may have an associated expression corresponding to one of the expressions allowed in the select list of the query. These expressions are called head expressions. The output of a box can be used multiple times (e.g., a view may be used multiple times in the same query), creating common subexpressions.
In the remainder of this paper we draw only rough sketches of QGM graphs, omitting details that are not critical to the discussion.
This particular invention is designed to simplify and optimize a query having both an EXISTS subquery and a NOT EXISTS subquery. Conventional decision support systems must handle and process complex queries efficiently. There is a need, especially in a DBMS interactive mode of operation, for the response time to be shortened drastically, even if a large amount of data is being processed. Conventional query optimization techniques include:
1) converting EXISTS subqueries into joins so that different join methods and join orders can be considered;
2) consequently eliminating redundant join operations, because by converting EXISTS subqueries into joins, the chance of eliminating redundant join operation increases;
3) applying decorrelation techniques if the subqueries are correlated;
4) caching subquery results and re-using the caches for subsequent queries or subquery invocations.
Some of these techniques were described in the following publications. Hamid Pirahesh, Joseph Hellerstein, and Waqar Hasan authored xe2x80x9cExtensible/Rule Based Query Rewrite Optimization in STARBURST,xe2x80x9d Proceedings of ACM SIGMOD ""92 International Conference on Management of Data, San Diego, Calif., 1992. Cliff Leung, Hamid Pirahesh, Praveen Seshadri, and Joseph Hellerstein authored xe2x80x9cQuery Rewrite Optimization Rules in IBM DB2 Universal Database,xe2x80x9d Readings in Database Systems, 3rd Edition, edited by Stonebraker and Hellerstein, 1998.
However, none of the conventional optimization technique addresses the performance issues of a query in which the WHERE clause contains both an xe2x80x9cEXISTSxe2x80x9d subquery and a xe2x80x9cNOT EXISTSxe2x80x9d subquery, or EXISTS-equivalent joins, in Boolean factor. Therefore, there is further need in the art for optimization of these queries.
The foregoing and other objects, features, and advantages of the present invention will be apparent from the following detailed description of the preferred embodiment which makes reference to several drawing figures.
One preferred embodiment of the present invention includes a method of optimizing a query in a computer in which a WHERE clause contains both an xe2x80x9cEXISTSxe2x80x9d subquery and a xe2x80x9cNOT EXISTSxe2x80x9d subquery, or their equivalent subqueries.
The method has the following steps:
(a) analyzing the query to determine whether the query includes a WHERE clause which contains an xe2x80x9cEXISTSxe2x80x9d subquery or the equivalent and a xe2x80x9cNOT EXISTSxe2x80x9d subquery or the equivalent;
(b) performing a subsumption test to determine whether the subqueries are subsumed or identical;
(c) if identical, adding a FALSE predicate to the WHERE clause of the query;
(d) if subsumed and not identical, performing a transformation of the query to merge the subqueries; and
(e) executing the query in the computer to retrieve data from the relational database.
The step (d) of performing the transformation of the query includes the QGM query transformation, and it further has the steps:
stacking the xe2x80x9cNOT EXISTSxe2x80x9d subquery block (A) on top of the xe2x80x9cEXISTSxe2x80x9d subquery block (B);
adding to output columns of the block B all columns necessary for the block A;
applying a nullability test to all said columns necessary for the block A;
if not all said columns nullable, designating a non-nullable column as a column C;
if all said columns nullable, adding a column C to said output columns of the block B and filling the column C with a pre-determined constant value;
transforming a compensation predicate CP, obtained during the subsumption test of the step (b), into the form of xe2x80x9c(C IS NULL) OR ((C IS NOT NULL) AND CP); and
applying the transformed compensation predicate to the block A to merge the subqueries.
Another preferred embodiment of the present invention includes a program storage device storing the query optimization program, according to the method of the present invention.
Yet another preferred embodiment of the present invention includes an apparatus for optimizing a query, according to the method of the present invention.