1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular to the simplification and optimization of derived tables that return exactly one row and scalar subqueries in a database management system.
2. Description of Related Art
Databases are computerized information storage and retrieval systems. A Relational Database Management System (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. RDBMS software using a Structured Query Language (SQL) interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Organization (ANSI) and the International Standards Organization (ISO).
In RDBMS software all data is externally structured into tables. The SQL interface allows users to formulate relational operations on the tables either interactively, in batch files, or embedded in host language, such as C, COBOL, etc. Operators are provided in SQL that allow the user to manipulate the data, wherein each operator operates on either one or more tables and produces a new table as a result. The power of SQL lies on its ability to link information from multiple tables or views together to perform complex sets of procedures with a single statement.
One of the most common SQL queries executed by RDBMS software is the SELECT statement. In the SQL standard, the SELECT statement generally has the format: xe2x80x9cSELECT  less than clause greater than  FROM  less than clause greater than  WHERE  less than clause greater than  GROUP BY  less than clause greater than  HAVING  less than clause greater than  ORDER BY  less than clause greater than .xe2x80x9d The clauses generally must follow this sequence. Only the SELECT and FROM clauses are required and all other clauses are optional.
Generally, the result of a SELECT statement is a subset of data retrieved by the RDBMS software from one or more existing tables stored in the relational database, wherein the FROM clause identifies the name of the table or tables from which data is being selected. The subset of data is treated as a new table, termed the result table. The WHERE clause determines which rows should be returned in the result table. Generally, the WHERE clause contains a search condition that must be satisfied by each row returned in the result table. The rows that meet the search condition form an intermediate set, which is then processed further according to specifications in the SELECT clause.
The search condition typically comprises one or more predicates, each of which specifies a comparison between two values from certain columns, constants or correlated values. Multiple predicates in the WHERE clause are typically connected by Boolean operators.
The SELECT statement may also include a grouping function indicated by the GROUP BY clause. The GROUP BY clause causes the rows in the intermediate result set to be grouped according to the values specified in the clause. A number of column or aggregate functions are also built into SQL, such as MAX (maximum value in column), MIN (minimum value in column), AVG (average of values in column), SUM (sum of values in column), and COUNT (number of rows).
Queries using aggregate expressions return as many result rows as there exist unique xe2x80x9cresult groupsxe2x80x9d in the source of the aggregation. A result group is defined as the set of values contained in the fields of a row corresponding to the list of columns or expressions specified in the GROUP BY clause of the query. The value of the aggregate function is the result of applying the function to the result of the contained expression for each row having the same group value.
Another operation permitted by SQL is the JOIN operation, which concatenates horizontally all or parts of two or more tables to create a new resulting table. The JOIN operation is implied by naming more than one table in the FROM clause of a SELECT statement.
An SQL query generally includes at least one predicate, which is an SQL expression that can assume a logical value of TRUE, FALSE, or UNKNOWN. A predicate typically either specifies a data range, tests for an existence condition, tests for equivalence, or performs a similar column comparison operation.
In RDBMSs, columns of any type can assume NULL (i.e., unknown) values. In RDBMS software, NULL values are properly handled using tri-value logic (i.e., TRUE, FALSE or UNKNOWN) for predicates, and SQL-based RDBMSs employ such logic.
SQL queries can have subqueries or derived tables where the nesting can be arbitrarily deep.
If we have a view defined on table S and use the view in a query as follows:
CREATE VIEW V AS (SELECT C1, C2 FROM S)
SELECT*
FROM T, V
WHERE T.C1=V.C1
the following query uses an equivalent derived table DT in lieu of the view definition:
SELECT*
FROM T, TABLE(SELECT C1, C2 FROM S) AS DT(C1, C2)
WHERE T.C1=DT.C1
Nested SQL statements may require tuple-by-tuple data manipulation in each subquery for evaluation of the complete statement. For example, each entry of a table column may need to be compared against each entry of a corresponding column in another table to determine if a SELECT operation should retrieve a table row. Such tuple-by-tuple operations are very inefficient and require simplification and optimization.
In an SQL system, queries are received by the SQL interface and are rewritten in an SQL processor from the input format provided by the user into generally standard SQL language. The SQL processor is sometimes implemented as an SQL compiler. To evaluate the query, an SQL execution plan is generated by the SQL processor from the rewritten SQL code and is provided to an SQL optimizer, which determines the best implementation of the execution plan.
Graphical Representation of SQL Queries by Query Graph Model (QGM)
A useful tool in describing SQL queries and their evaluation is a graphical representation of SQL statements known as Query Box Representation (QBR). Known SQL optimizers employ various types of QBR to facilitate their processing and optimization efforts. QBR provides a powerful and manageable representation of queries used to reduce the complexity of query compilation and optimization procedures. Additionally, QBR provides a tool for optimizing queries by facilitating rewriting the QBR components in efficient arrangements.
Generally, a QBR representation employs hierarchically arranged xe2x80x9cboxesxe2x80x9d to represent subquery operations, such as SELECT and GROUP BY. The interconnections between the boxes define the relationship between different items of data from different subquery operations.
Various techniques may be used to store a complex query in computer memory in accordance with a QBR format. Specifically, multiple data structures in the memory are used to represent the QBR boxes and interrelationships. These data structures may comprise, for example, records and associated pointers, strings, stack arrangements, doubly linked lists, hashing arrangements, or other suitable formats. Preferably, records and pointers are used, where each record represents a QBR box, and the pointers represent the relationships between the boxes. Each record also includes a listing of various properties associated with the corresponding QBR box.
A number of different types of QBR are known in the art. One of the more popular types of QBR is known as the Query Graph Model (QGM), well known in the art, and described in the following reference, which is incorporated by reference in its entirety: Pirahesh et al., xe2x80x9cExtensible/Rule Based Query Rewrite Optimization in Starburst,xe2x80x9d Proceedings of ACM SIGMOD ""92 International Conference on Management of Data, San Diego, Calif., U.S.A., 1992.
The query graph model supports arbitrary table operations where the inputs are tables and outputs are tables. Such operations are SELECT, GROUP BY, UNION, INTERSECT, and EXCEPT operations, as defined in the SQL standard, and the restriction, projection and join operations performed by the SELECT operation.
When a query is represented by the query graph model, each operation of the query may be graphically represented by a box having a head and body. The body of a query graph model box may include xe2x80x9cquantifiersxe2x80x9d, corresponding to table references or derived tables. Quantifiers are labeled with the columns that it needs from the table it ranges over. Quantifiers may have different types. Type xe2x80x9cFxe2x80x9d (For Each) quantifiers correspond to the table references in the FROM clause of the SQL query. Type xe2x80x9cAxe2x80x9d (Universal) quantifiers correspond to the ALL subquery in the WHERE clause. Type xe2x80x9cExe2x80x9d (Existential) quantifiers correspond to the subquery associated with SQL""s predicates EXISTS, IN, ANY, and SOME, which are true if at least one tuple of the subquery satisfies the predicate. In QGM, the quantifiers associated with existential and universal subqueries are called COUNTING quantifiers. Type xe2x80x9cSxe2x80x9d (Scalar) quantifiers correspond to scalar subqueries. The database system has to enforce the rule that exactly one tuple has to be returned from the scalar subquery. If the subquery does not produce any rows, a tuple with null value(s) will be returned via the S quantifier. On the other hand, if the subquery returns more than one rows, a runtime error will be raised. Additionally, quantifiers may be ordered within a box to support asymmetric operators, such as EXCEPT.
The query graph model boxes are coupled to each other by quantifier xe2x80x9cedgesxe2x80x9d, also called xe2x80x9carcsxe2x80x9d. The head of a box describes the output data stream or table produced by the operation, and the body specifies the operation required to compute the output table from records flowing into the body. The head specifies each output column, such as those specified in the SELECT clause of a query, and includes column names, types, and output ordering information. The head, body, and quantifiers of a box may each have a separate attribute called xe2x80x9cDISTINCTxe2x80x9d, to manage the production of duplicate tuples.
FIG. 1 illustrates an exemplary query graph model corresponding to the following SQL query.
This query provides information about suppliers and parts for which the supplier price is less than that of all other suppliers. The corresponding query graph model of FIG. 1 contains four boxes 100, 101, 102, 103. Two of the boxes 100 and 101 are associated with the tables INVENTORY and QUOTATIONS, respectively. The third box 102 is a SELECT box representing the main part of the query, and the box 103 is a SELECT box representing the subquery.
In accordance with the query graph model convention, each of the boxes 100-103 has a head and a body, where each head describes the output columns produced by the associated box, and the body specifies the operation required to compute the output. xe2x80x9cBasexe2x80x9d tables, exemplified by boxes 100-101, are considered to have empty or nonexistent bodies because they do not perform any operations.
Referring in greater detail to the SELECT box 102, a head 105 is shown, specifying output columns PARTNO, DESCR, and SUPPNO which correspond to the SELECT clause of the main query. The specification of these columns includes column names, types, and output ordering information. The head 105 includes a distinct attribute 106, which indicates that the associated table contains only distinct tuples and may not have duplicates.
The body of each non-base table box contains a graph, the vertices of which represent quantified tuple variables, the quantifiers. The vertices, which are represented with dark circles, are also called xe2x80x9cnodesxe2x80x9d. In the box 102, for example, vertices 108-110 correspond to quantifiers Q1, Q2, and Q4. Quantifiers Q1 and Q2 range over the INVENTORY and QUOTATIONS base tables respectively, and correspond to the table references in the FROM clause of the SQL query.
As stated above, arcs (also called xe2x80x9cinter-box edgesxe2x80x9d) connect quantifiers to other boxes or to each other. For example, the Q1 and Q2 quantifiers (nodes 108-109) are connected via arcs 112-113 to the beads of the boxes 100-101, respectively. Arcs may also interconnect quantifiers of the same box where each inter-quantifier arc represents a predicate of the WHERE clause in the query block. Such predicates are represented by labeled rectangles along the corresponding edge. Such edges are also called Boolean factors. For example, the Q1 and Q2 quantifiers are interconnected by an edge 116 that represents a join predicate. A looped edge 118 attached to the Q1 (node 108) represents a local predicate on the quantifier Q1. In the head 105 of the box 102, each output column may have an associated expression corresponding to expressions allowed in the SELECT clause of the query. In FIG. 1, all of these expressions are simple identity functions over the referenced quantified columns.
Quantifier Q4 is a UNIVERSAL quantifier is of type A, because it is associated with the ALL subquery in the WHERE clause. This means that for ALL tuples associated with Q4, the predicate represented by the edge between Q2 and Q4 is TRUE.
In Box 3, Q1 and Q2 participate in joins, and some of their columns are used in the output tuples. These quantifiers have type F (ForEach), since they come from the query""s FROM clause. Box 4 represents the example""s subquery. It contains an F quantifier Q3 over the QUOTATIONS table, and has a predicate that refers to Q2 and Q3.
The body of every box in QGM has an attribute called DISTINCT that may have one of the following values: ENFORCE, PRESERVE or PERMIT. ENFORCE means that the operation must eliminate duplicates in order to enforce head.distinct=TRUE. PRESERVE means that the operation must preserve the number of duplicates it generates. This could happen because head.distinct=FALSE, or because head.distinct=TRUE and no duplicates could exist in the output of the operation, even without duplicate elimination. PERMIT means that the operation is permitted to eliminate (or generate) duplicates arbitrarily. For example, the DISTINCT attribute of Box 4 can have the value PERMIT, because its output is used in a universal quantifier (Q4 in Box 3), and universal quantifiers are insensitive to duplicate tuples.
Like the body of every box, each quantifier also has an attribute called DISTINCT that may have a value of ENFORCE, PRESERVE or PERMIT. ENFORCE means that the quantifier requires the table over which it ranges to enforce duplicate elimination. PRESERVE means that the quantifier requires that the exact number of duplicates in the lower table be preserved. PERMIT means that the table below may have an arbitrary number of duplicates. Existential and universal quantifiers can always have DISTINCT=PERMIT, since they are insensitive to duplicates.
RDBMS software supports view definitions, and a view can be used anywhere a table can be used. Views, just like queries and subqueries, may be represented with a QGM, with one or many boxes. When a view is referenced in a query, its QGM becomes part of the QGM graph of the query. Derived tables, which are similar to view definitions, can be defined anywhere a table can be used, and their QGM become part of the QGM graph of the query.
In the body of every box, each output column may have an associated expression corresponding to one of the expressions allowed in the select list of the query. These expressions are called head expressions. The output of a box can be used multiple times (e.g., a view may be used multiple times in the same query), creating common subexpressions.
In the remainder of this paper we draw only rough sketches of QGM graphs, omitting details that are not critical to the discussion.
This particular invention is designed to simplify and optimize scalar subqueries, although it may be applied to derived tables as well. By definition, scalar subquery returns one and only one row. Therefore, there are two requirements that a DBMS must honor when dealing with scalar subqueries. First, if the subquery returns an empty set, a row of null(s) will be returned instead. Second, if the evaluation of the subquery returns more than one row, a runtime error is generated. Generally speaking, we refer to these subqueries and derived tables as xe2x80x9csubselect expressionsxe2x80x9d.
Because of the above two properties, scalar subqueries are more difficult to optimize. Many researchers have tried to transform a scalar subquery to a join, which is much easier to optimize, and to define conditions under which such transformation is valid. According to one such example, a valid condition exists if a query optimizer can show that the maximal number of rows to be returned by the subquery is one, and that, when a row of nulls is returned by the subquery, the row will not change the query answer set.
Some of the conditions summarized in the prior art are defined in the publication by Cliff Leung, Hamid Pirahesh, Praveen Seshadri, and Joseph Hellerstein entitled xe2x80x9cQuery Rewrite Optimization Rules in IBM DB2 Universal Database,xe2x80x9d Readings in Database Systems, 3rd Edition, edited by Stonebraker and Hellerstein-, 1998.
A more recent research demonstrates an optimization technique where a scalar subquery can be converted into a join using subsumption techniques. It is described in U.S. patent application 09/450,795, entitled xe2x80x9cExploitation of Subsumption in Optimizing Scalar Subqueries, with filing date (Aug. 13, 1998 Filed in US; Co-inventor: Pirahesh, M. H.)
While there have been various techniques developed for optimizing the performance of SQL statements by transforming scalar subquery and derived tables which produce exactly or at most one tuple, there is further need in the art for optimization of these queries.
The foregoing and other objects, features, and advantages of the present invention will be apparent from the following detailed description of the preferred embodiment which makes reference to several drawing figures.
One preferred embodiment of the present invention includes a method of optimizing a query in a computer by merging quantifiers, the query being performed by the computer to retrieve data from a database stored in an electronic storage device coupled to the computer. The method analyzes the query to determine whether the query includes at least two subselect expressions, and performs the subsumption test to determine whether the two subselects produce the same row sets except the output columns. If so, it determines whether each subselect returns at most one tuple from the database. If so, it performs merging the two subselects into a single subselect, and merging quantifiers corresponding to the merged subselect expressions.
Another preferred embodiment of the present invention includes a program storage device storing the optimization program for merging quantifiers.
Yet another preferred embodiment of the present invention includes an apparatus for optimizing a query according to the method for merging quantifiers.
Still another preferred embodiment of the present invention includes a method of optimizing a query in a computer, by eliminating subqueries, the subquery being performed by the computer to retrieve data from a database stored in an electronic storage device coupled to the computer. The method analyzes the query to determine whether the query includes at least one subselect. If so, it determines whether the subselect returns exactly one tuple from the database. If so, it determines whether the result set returned from the subselect is not required for query answer set. If so, the method eliminates the subselect expression and the quantifier corresponding to the subselect.
Another preferred embodiment of the present invention includes a program storage device storing the optimization program for eliminating quantifiers.
Yet another preferred embodiment of the present invention includes an apparatus for optimizing a query according to the method for eliminating quantifiers.