The present invention relates to query optimization, and in particular, optimizing queries by rewriting the queries.
Storing and retrieving large amounts of data are some of the most important functions of computers in today""s society. To carry out these functions, database systems are typically used to retrieve and store data in databases. Database systems developed so far have performed these functions very successfully, creating for society the ability to retrieve data at speeds and quantities previously unimagined, and giving society an unprecedented level of access to information. The success of database systems has unleashed an insatiable demand for even faster and more efficient database systems that process even greater quantities of data.
In a database management system (DBMS), data is stored in one or more data containers, such as tables. The term table is used to refer to any set of data that is processed as a set of records, each record being organized into one or more fields. In relational database systems, the records are referred to as rows, and the fields are referred to as columns. A table may be a data container with rows and columns, or the results of a query, including subqueries. A subquery is a query within a query, and shall be described later in further detail.
In object-oriented databases, the data containers correspond to object classes, the records are referred to as objects, and the fields are referred to as attributes. Other database architectures may use other terminology.
Systems that implement the present invention are not limited to any particular type of data container or database architecture. However, for the purpose of explanation, the examples and the terminology used herein shall be that typically associated with relational databases. Thus, the terms row and column shall be used herein to refer respectively to record and field.
Database systems retrieve information in response to receiving queries that specify the information to retrieve. In order for a database system to understand the query, the query should conform to a database language recognized by the database system, such as the ANSI Structured Query Language (SQL). To illustrate the structure of a query, the following query E1 is provided.
SELECT T.a FROM T WHERE T.b=1 and T.c=1
Query E1, when executed by a database system that supports SQL, retrieves column T.a of table T from rows containing the value 1 in column T.b of table T. The above query includes a SELECT clause (i.e. xe2x80x9cSELECT T.axe2x80x9d), a FROM clause (i.e. xe2x80x9cFROM Txe2x80x9d), and a WHERE clause (i.e. xe2x80x9cT.b =1 and T.c=1xe2x80x9d). The FROM clause references one or more tables from which to retrieve values. The tables in the FROM clause are collectively referred to as the FROM list. The SELECT clause specifies one or more columns in the items in the FROM list from which to retrieve values. The one or more columns in the SELECT clause are collectively referred to as the SELECT list.
The WHERE clause specifies the rows from which the values are retrieved. Specifically, the WHERE clause contains one or more conditions defining criterion that must be met by a column(s) of the rows from which values are retrieved. The conditions in the WHERE clause are referred to as predicates. Query E1 contains the predicates xe2x80x9cT.b=1xe2x80x9d and xe2x80x9cT.c=1xe2x80x9d. The term xe2x80x9cexpressionxe2x80x9d is used to refer to a logical expression based on one or more predicates. Query E1 contains the expression xe2x80x9cT.b=1 and T.c=1xe2x80x9d. The criteria specified by the WHERE clause is referred to herein collectively as filtering criteria.
When a database system executes a query, it generates results in the form of a table. Such a table is referred to herein as a result set. The result set has the columns specified in the SELECT list of the query.
One of the most important functions for data generation and retrieval performed by a database system is the generation of aggregated information. Aggregated information is information derived by applying an aggregate function to the values in a column of a subset of rows in a xe2x80x9cbase tablexe2x80x9d. Examples of aggregate functions are functions that sum values, calculate averages, and determine minimum and maximum values. The column that contains the values to which an aggregate function is applied is referred to as the measure.
The subsets of rows to which an aggregate function is applied are determined by values in a xe2x80x9cgroup-byxe2x80x9d column. The aggregate information generated by a database system is presented as a result set having the group-by column and the measure column. In particular, the result set has one row for each unique value in the group-by column. Each row in the result set corresponds to the group of rows in the base table containing the value for the group-by column of the row. The measure column in the row contains the output of the aggregate function applied to the values in the measure column of the group of rows.
Aggregate information is generated by a database system in response to receiving an aggregate query. An aggregate query specifies a group-by column, the measure column, and the aggregate function to apply to the measure values. The following query E2 is provided as an illustration.
Table t contains data representing the sales of an organization. Each row represents a particular sales transaction. For a particular row in table t, column d contains the date of the sales transaction, and s contains the sale amount.
The SELECT clause in query E2 contains xe2x80x9csum(s)xe2x80x9d, which specifies that the aggregate function xe2x80x9csumxe2x80x9d is to be applied to the values in column s of table t. The query E2 also includes the group-by clause xe2x80x9cGroup by dxe2x80x9d, which denotes column d as the group-by column.
Execution of query E2 generates a result set with a column for d and sum (s). A particular row in the result set represents the total sales for all sale transactions in a given day. Specifically, for a particular row in the result set, d contains a unique date value from table t for column d. Column sum_s contains the sum of the sales amount values in column s for the group of rows from t that have the unique date value in column d.
It is often useful to generate aggregate information grouped by multiple columns. For example, table t may also contain column r, a column containing values representing regions.
It is may be useful to generate a result set that summarizes sales by region, and for each region, sales date. Such a result set may be generated by referencing column r and d in the group-by clause, as illustrated by the following query E3.
Another useful way to provide aggregate information is to generate one result set that groups data by various combinations of columns. For example, a result set may contain a set of rows grouped by region and date, and a set of rows grouped only by region. Such a result set may be generated by submitting a query that includes multiple subqueries operated upon by the union operator.
The union operator is used to specify that the result set of a subquery is to be merged with the result set of another subquery. The following query E3 is used to illustrate how the union operator may be used to specify a result set that groups by multiple combinations of group-by columns.
Query E3 specifies four subqueries, each with a group-by clause specifying a different combination of columns by which to group rows. A particular group of columns by which to group rows is referred to herein after as a grouping. The first subquery in query E3 contains the group-by clause xe2x80x9cgroup by a, bxe2x80x9d, thus specifying a grouping of (a,b). The remaining subqueries respectively specify the groupings (a), (b), and ( ). xe2x80x9c( )xe2x80x9d denotes the grand total, specifying no group-by column.
The union operator has two operands, a table that precedes the operand and a table that follows the operand. The operands may designate any form of a table, including base tables, views, or the result set of a subquery. Generally, the subquery that precedes the first union operator in a union query dictates the number order, and data type of the columns of the result set of the union query. The remaining subqueries whose result sets are to be merged should specify the same number columns.
Execution of the query E3 generates a result set with three columns and four sets of rows. The three columns include a, b, and sum(m). The first set of rows is grouped according to (a,b), the second set according to (a), the third according (b), and the fourth according to ( ).
All rows generated for all groupings will include a column for every group-by column. However, for the rows of a particular grouping that do not include a particular group-by column, the rows contain a null value for that column. Thus, all rows for grouping (a) will contain the null value in b. All rows for the grouping (b) will contain the null value in a. All rows for the grouping ( ) will contain the null value in a and b.
While union queries may be used to generate a result set with multiple groupings, they can be very tedious to write. The programmer of the subquery must write a subquery for each desired grouping, which may become extraordinarily tedious when the number of groupings desired is relatively large.
To avoid this burden, SQL defines extended group-by operators. Extended group-by operators include cube, rollup, and grouping sets. The group-by operators are used to specifying groupings that include various combinations of the columns specified as arguments to the functions. For example, using the cube function, an equivalent query of query E1 (i.e. one that produces the same result set) may be written as follows.
The cube function denotes that the result should reflect groupings for all possible combinations of the argument columns. Thus, the function xe2x80x9ccube (a,b)xe2x80x9d specifies groupings (a,b), (a), (b), and ( ).
Aggregate queries may be xe2x80x9cinner queriesxe2x80x9d. An inner query is a subquery whose result set is referenced as a table in the From clause of an xe2x80x9couter queryxe2x80x9d. When database systems execute a query containing an outer query and an inner query, the inner query is sometimes computed first. This is typical for inner queries which are hard to merge into outer queries, like inner queries with aggregation. Consequently, the entire result set of the inner query is generated before computation of the outer query.
The predicates of the outer query may reference columns of the result set of the inner query. When the outer query is executed, the rows from the result set generated for the inner query may be filtered. Sometimes, the entire set of rows for a grouping may be eliminated. To illustrate how an entire grouping may be eliminated in this way, the following query E4 is provided.
The inner query includes the extended group-by operator xe2x80x9ccube (a,b)xe2x80x9d. Computation of the inner query generates a result set with rows for groupings (a,b), (a), (b), and ( ). However, rows for groupings (a), (b), and ( ) can not possible satisfy the filtering criteria of the outer query in E4 because the rows for these groupings will contain a NULL value either in v_cube.a or v_cube.b. Therefore, rows are computed for groupings that cannot possibly satisfy criteria imposed by the outer query.
When the rows for a grouping specified by an inner query cannot possibly satisfy the criteria imposed by the outer query, computational resources are wasted generating data for rows that can not possibly affect the content of the result set of the outer query. It is therefore clearly desirable to provide a mechanism that avoids expending computational resources to generate rows for groupings that cannot possibly satisfy the criteria imposed by the predicates of the outer query.
Described herein is a system for rewriting queries so that they may be executed more efficiently. Queries that reference the result set of an aggregate query with multiple groupings are rewritten to reference another aggregate query in the form of an inner query that omits groupings that cannot possibly satisfy the criteria imposed by the predicates of the outer query. Thus, when the inner query is computed, only rows for groupings that satisfy the criteria are generated, conserving resources that would otherwise be wasted by generating rows that could not possibly satisfy the criteria.