1. Field of the Invention
The present invention relates to a method, system, and program for processing queries involving set operations.
2. Description of the Related Art
Data records in a relational database management system (RDBMS) in a computer are maintained in tables, which are a collection of rows all having the same columns. Each column maintains information on a particular type of data for the data records which comprise the rows. One or more indexes may be associated with each table. An index is an ordered set of pointers to data records in the table based on the data in one or more columns of the table. In some cases, all the information needed by a query may be found in the index, making it unnecessary to search the actual table. An index is comprised of rows or index entries which include an index key and a pointer to a database record in the table having the key column values of the index entry key. An index key is comprised of key columns that provide an ordering to records in a table. The index key columns are comprised of the columns of the table, and may include any of the values that are possible for that particular column. Columns that are used frequently to access a table may be used as key columns. Using an index to search and access rows in the associated table substantially improves query performance.
Database tables may be accessed using the Structured Query Language (SQL), which comprises a recognized language to query, access and manipulate data in a database. The SQL language includes set operators UNION, UNION ALL, INTERSECT, INTERSECT ALL, EXCEPT, and EXCEPT ALL. The UNION operator produces a result table that has all rows from two input tables, which are the result of query operations of some base tables (each may contain more than one), the INTERSECT operator produces a result table that has all rows in common between two input tables, and the EXCEPT operator produces a result table having all rows that are in a first table, but not a second table. The input tables to set operators are generally result tables that result from either subselects or fullselects that contain set operators. The ALL suffix used with these set operators, e.g., UNION ALL, includes all duplicate rows from the result table, whereas using the set operator without the ALL suffix, e.g., UNION, removes duplicate rows from the result table.
Complex database queries may use the set operators UNION, INTERSECT, and EXCEPT in combination with other query operations, such as a JOIN. A join query forms all combinations of rows from two or more tables according to a join condition, such that the combined rows must satisfy the predicates of the join condition. One performance disadvantage with performing subsequent query operations, such as join operations, on tables derived from the set operators is that the table derived from the set operator is materialized on a hard disk drive (xe2x80x9cdiskxe2x80x9d) before the query operation can be performed thereon. For instance, for a UNION operation, the rows of the two combined input tables are materialized in a work file on disk. Subsequent query operations would then have to read the rows from the work file materialized, i.e., written, on disk. Disk Input/Output (I/O) operations have significant overhead and adversely affect query performance because of the time required to write the result table from memory to disk, and then read back the rows of the result table from disk to use in the subsequent query operations. Moreover, performance with respect to the subsequent query operations is further degraded as the size of the materialized result table increases, because a larger materialized result table requires more disk I/O operations.
Materializing a result table derived from the set operators also harms performance because there is no index available when applying the query operation to the materialized result table. Indexes can substantially improve query performance, especially join query performance. Although an index could be generated for the materialized result table for use in the subsequent query operations, the time required to generate such an index would adversely affect performance. Moreover, generating a new index for the materialized result table does not avoid the need for the disk I/O operations, which, as discussed above, adversely affect the performance of subsequent query operations using the materialized result table.
For all these reasons, there is a need in the art to develop improved query optimization techniques to handle query operations applied to result tables derived from set operators.
Provided is a method, system, and program for processing a query including a query operation on a table derived from a set operation on two input tables. The query operation is performed on each input table separately to produce two intermediate result tables. The set operator is then applied to the two intermediate result tables to produce a final result table that is a same result table that would have been produced by performing the query operation on the table derived from the set operation performed on the two input tables.
In further implementation, the final result table is generated without having to materialize in a storage device any intermediate result tables.
Still further, the input tables may comprise base tables having indexes, and wherein the indexes on the two base tables may be used to perform the query operation on each input table separately.
In certain implementation, the set operation comprises one of a UNION ALL, EXCEPT ALL or INTERSECT ALL operation. In such case, if the query operation comprises a WHERE clause including predicates to apply to the table derived from the set operation on the two input tables, then the WHERE clause is applied to each of the input tables separately. Further, if the query operation comprises a JOIN of a table to the table derived from the set operation on the two input tables, then the JOIN of the table is applied to each of the input tables separately to produce the two intermediate result tables.
In implementations where the query operation comprises a column function and the set operation comprises a UNION ALL, then the column function is applied to each of the input tables separately to produce at least one value for each input table subject to the set operation. The at least one value for each of the input tables is aggregated to produce at least one aggregate value that would have been produced by applying the column function to the UNION ALL of the two input tables.
The described implementations provide techniques to optimize query processing when query operations are performed on tables derived from set operators.