1. Field
Embodiments of the invention relate to cost-based subquery correlation and decorrelation.
2. Description of the Related Art
Relational DataBase Management System (RDBMS) software using a Structured Query Language (SQL) interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Institute (ANSI) and the International Standards Organization (ISO).
A query may be described as a request for information from a database based on specific conditions. A query may contain a subquery. If the query is a first (or “outer”) SELECT statement, the subquery may be described as a second (or “inner”) SELECT statement within a WHERE or HAVING clause of the first select statement. The subquery typically includes a search condition that refers to a value or set of values needed for the first query to be answered.
In conventional systems, a query optimization phase is separated into two stages: a “query rewrite” stage and a “query optimization” stage. The query rewrite stage performs semantic checking and query transformations. That is, the query may be rewritten (“transformed”) during the query rewrite stage. The query optimization stage takes the transformed query and chooses an efficient access path based on how the query was transformed. An access path may be described as the technique that is selected for retrieving data from a specific table. For example, an access path can involve the use of an index, a sequential scan, or a combination of the two. Thus, execution of the query may be optimized during the query optimization phase of processing.
A subquery correlation may be described as a subquery that contains a correlated reference to a column of a table that is outside the subquery. For example, a subquery correlation may include a subquery that is part of a WHERE or HAVING clause that is applied to a row or group of rows of a table or view that is named in an outer subselect statement. Correlation may be described as adding a correlated predicate to a subquery, and decorrelation may be described as removing a correlated predicate from the subquery.
Some transformations, such as subquery correlation and decorrelation do not always result in a more efficient access path and may cause performance degradation. The transformation that provides the most efficient access path is not known until after the query optimization stage, but the query transformation occurs before the optimization. Thus, at query rewrite time, the best transformation is not known.
For example in query (1), the correlated subquery should be decorrelated:
SELECT * FROM BIG_TABLE A WHERE(1)EXISTS (SELECT 1 FROM SMALL_TABLE B WHEREB.COL = A.COL)
In query (1), the SELECT 1 statement is a subquery. Also, the condition B.COL=A.COL refers to table A, which is in the outer select, and this reference is an example of a correlated subquery. For this example, it is assumed that BIG_TABLE is very large, SMALL_TABLE is very small, and a unique index is available on BIG_TABLE.COL. The access path for query (1), as written, would be to scan BIG_TABLE and probe SMALL_TABLE for every row of BIG_TABLE. This, however, requires more processing than query (2), which results from transforming query (1) during a query rewrite stage.
SELECT * FROM BIG_TABLE A WHERE(2)A.COL IN (SELECT B.COL FROM SMALL_TABLE B)
In transformed query (2), the subquery has been decorrelated, allowing SMALL_TABLE to be accessed first and then joined to BIG_TABLE (after duplicates are removed from SMALL_TABLE) using the unique index on BIG_TABLE. In this example, decorrelating the subquery resulted in better performance.
However, if the same tables were switched, then decorrelating the subquery would cause performance to degrade. Query (3) provides an example of switching the tables:
SELECT * FROM SMALL_TABLE B(3)WHERE EXISTS (SELECT 1 FROM BIG_TABLE AWHERE A.COL = B.COL)
In query (3), the subquery (i.e., the SELECT 1 statement) should not be decorrelated to remove the reference to B.COL. However, the only way to know whether performance would be better with a correlated or decorrelated subquery would be to determine a cost for each subquery and compare the costs. This means that in the query rewrite stage, the query is rewritten in correlated form and in decorrelated form. Then, the query optimization stage is invoked for each version of the rewritten query. For a query that contains a single subquery, this effectively doubles the amount of optimization processing needed to optimize the query. For queries that contain multiple subqueries, optimization processing may increase exponentially.
Therefore, there is a need in the art for improved query rewrite and optimization.