The present invention relates generally to database management systems and, more specifically, to techniques for efficiently evaluating database queries including a grouping sets specification.
In a database management system (DBMS), data is stored in one or more data containers, each container contains records, and the data within each record is organized into one or more fields. In relational database management systems, the data containers are referred to as tables, the records are referred to as rows, and the fields are referred to as columns. In object oriented databases, the data containers are referred to as object classes, the records are referred to as objects, and the fields are referred to as attributes. Other database architectures may use other terminology.
Database management systems retrieve information in response to receiving queries that specify the information to retrieve. In order for a database management system to understand the query, the query should conform to a database language recognized by the database management system, such as the Structured Query Language (SQL).
In an OLAP (on-line analytical processing) environment or a data warehousing environment, data is often organized into a star schema. A star schema is distinguished by the presence of one or more relatively large tables and several relatively smaller tables. Rather than duplicating the information contained in the smaller tables, the large tables contain references (foreign key values) to rows stored in the smaller tables. The larger tables within a star schema are referred to as xe2x80x9cfact tablesxe2x80x9d, while the smaller tables are referred to as xe2x80x9cdimension tablesxe2x80x9d. Typically, each dimension has xe2x80x9clevelsxe2x80x9d which correspond to columns of the dimension table, which are organized in a hierarchical manner. For example, a TIME dimension might consist of the levels year, quarter, month, and day, corresponding to columns of the TIME table. The hierarchical organization is such that years consist of quarters, quarters consist of months, and months consist of days.
An important function for data generation and retrieval performed by a database management system is the generation of aggregated information. Aggregated information is information derived by applying an aggregate function to the values in a column of a subset of rows in a table or on the result of a join of two or more tables. Examples of aggregate functions are functions that sum values, calculate averages, and determine minimum and maximum values. The column that contains the values to which an aggregate function is applied is referred to as the measure.
The subsets of rows to which an aggregate function is applied are determined by values in xe2x80x9cgroup-byxe2x80x9d columns. The aggregate information generated by a database management system is presented as a result set having the group-by column(s) and the aggregated measure column. In particular, the result set has one row for each unique value in the group-by column. Each row in the result set corresponds to the group of rows in the base table containing the value for the group-by column of the row. The measure column in the row contains the output of the aggregate function applied to the values in the measure column of the group of rows.
Aggregate information is generated by a database management system in response to receiving an aggregate query. An aggregate query specifies a group-by column, the aggregate measure column, and the aggregate function to apply to the measure values. The following query is provided as an illustration.
SELECT d, SUM(S) sum_s
FROM t
GROUP BY d
Table t contains data representing the sales of an organization. Each row represents a particular sales transaction. For a particular row in table t, column d contains the date of the sales transaction, and s contains the sale amount.
The SELECT clause contains xe2x80x9cSUM(s)xe2x80x9d, which specifies that the aggregate function xe2x80x9csumxe2x80x9d is to be applied to the values in column s (aggregate measure) of table t. The query also includes the group-by clause xe2x80x9cGROUP BY dxe2x80x9d, which denotes column d as the group-by column.
Execution of this query generates a result set with a column for d and a column for sum (s). A particular row in the result set represents the total sales (s) for all sale transactions in a given day (d). Specifically, for a particular row in the result set, d contains a unique date value from table t for column d. Column sum_s contains the sum of the sales amount values in column s for the group of rows from t that have the unique date value in column d.
It is often useful to generate aggregate information grouped by multiple columns. For example, table t may also contain column r, a column containing values representing regions. It is may be useful to generate a result set that summarizes sales by region, and for each region, sales date. Such a result set may be generated by referencing column r and d in the group-by clause, as illustrated by the following query.
SELECT d, r SUM (s)
FROM t
GROUP BY r, d
Furthermore, often a query that includes an aggregate function specifies that information from one table is to be aggregated by groupings defined by keys of another table.
A GROUP BY clause (or operator) groups a database query result into subsets that have matching values for one or more columns. GROUP BY clauses are typically used in conjunction with aggregate functions, thus allowing mathematical operations to be performed on columns specified as arguments in the aggregate functions. The term xe2x80x9csimple GROUP BY clausexe2x80x9d is used herein in reference to a GROUP BY clause with no extension operations/operators, e.g., GROUP BY r, d, as shown above.
A useful way to provide information is to generate one result set that groups data by various combinations of columns. For example, a result set may contain a set of rows grouped by region and date, and a set of rows grouped only by region. Such a result set may be generated by submitting a query that includes multiple subqueries operated upon by the union operator. While union queries may be used to generate a result set with multiple groupings, they can be very tedious to write. The programmer of the subquery must write a subquery for each desired grouping, which may become extraordinarily burdensome when the number of groupings desired is relatively large. Furthermore, such queries are very inefficient to execute, as some tables are accessed multiple times.
To avoid this burden, SQL defines extended group-by operators. Extended group-by operators include cube, rollup, and grouping sets. The group-by operators are used to specify groupings that include various combinations of the columns specified as arguments to the operators. For example, using the rollup operator, a query may be written as follows.
SELECT year, quarter, month, SUM(sales)
FROM fact.time
WHERE fact.tkey=time.tkey
GROUP BY rollup(year, quarter, month).
According to techniques that employ sort-based algorithms, execution of this query sorts data produced by the FROM and WHERE clauses on three columns (year, quarter, and month) specified in the GROUP BY clause, in order to logically group the records for efficient aggregation on the column (sales) specified in the aggregate function (SUM). The rollup operator aggregates data across levels specified as the keys (or columns) of the rollup operator, specified in the GROUP BY line. For example, xe2x80x9cGROUP BY rollup(year, quarter, month)xe2x80x9d produces aggregated results on the following groups:
(year, quarter, month);
(year, quarter);
(year); and
( ).
Note that a rollup on n columns produces n+1 groups. The grand total (referred to as the highest or coarsest) group is the ( ) grouping, and the base (referred to as the lowest or finest) group is the (year, quarter, month) grouping.
Extensions to the GROUP BY clause have been developed to provide powerful aggregation commands that leverage the processing power of database servers. One such extension included in the Oracle 9i database, available from Oracle Corporation, is the grouping sets extension (also referred to herein as an operator or specification). Grouping sets operators perform multiple independent groupings of data records, giving a user the capability to specify exactly the groupings of interest in a SQL GROUP BY clause. Grouping sets operators produce a single result set which is equivalent to a UNION ALL of differently grouped rows, but at a significantly reduced computational cost. Furthermore, the grouping sets operator is more efficient than other grouping operators (e.g., cube or rollup) because it specifies computation of only the groups of interest, thus avoiding unnecessary computation.
The following example query includes a grouping sets operator.
SELECT year, quarter, month SUM(sales)
FROM fact, time
WHERE fact.tkey=time.tkey
GROUP BY GROUPING SETS ((year, quarter, month), (year, month), (year, quarter)).
Evaluation of this query generates results with data aggregated over the three groupings specified by the grouping sets operator. In general, execution of the grouping sets operator aggregates data across levels specified as the arguments (or columns) of the grouping sets specification in the GROUP BY line. Thus, the exact grouping sets specified are generated. In contrast, a cube would produce all of the eight possible groupings of columns (year, quarter, month), of which five are not of interest.
According to one technique that employs sort-based algorithms, execution of this query requires multiple joins of base tables specified in the FROM clause, and multiple sorts and aggregations according to the aggregate function (SUM) and the grouping sets operator. Thus, join, sort and aggregation are required for each grouping specified by the grouping sets operator. For example, three subqueries could be written, combined by UNION ALL commands, to produce the same result as the example query, but at much higher computational cost due to the three scans, joins and sorts required on the base tables (i.e., one per grouping set) in the FROM clause.
Even though using grouping sets is inherently more computationally efficient than using other grouping operators (for example, a cube) due to the specification and computation of only groupings of interest, implementation of grouping sets can additionally benefit from more efficient computational techniques.
Based on the foregoing, it is clearly desirable to provide a scalable mechanism for efficiently computing database queries that include a grouping sets extension to a GROUP BY clause.
Methods are provided for efficiently evaluating database queries that include grouping sets extension to a GROUP BY clause.
According to one embodiment, the grouping sets operator is reduced to one or more combinations of groupings such that each combination can be computed by a simple GROUP BY clause, with or without a rollup operator. Each GROUP BY clause, with or without a rollup operator, is an inherently efficient computation that can be computed with a single data sort. Computational costs are further minimized by computing a GROUP BY clause, with or without a rollup operator, from records generated from a previously computed grouping. Therefore, temporary tables are used to store records that are generated from one grouping, for access in computing another grouping. An optimization algorithm is described which determines the respective combinations to which the grouping sets operator is reduced and the respective GROUP BY clauses, with the corresponding rollup operator, if any, to minimize computational costs.
According to one embodiment, an intermediate grouping, that is, a grouping that is not specified by the query, is included in a query evaluation plan in some instances. Intermediate groupings are used to compute specified groupings in the grouping sets specification, when doing so reduces computational costs. An intermediate grouping algorithm is described that determines whether costs can be saved by using intermediate groupings. This algorithm is optional and can be applied after the optimization algorithm has been applied on a grouping set specification.
Various implementations of the techniques described are embodied in methods, apparatus, and in computer-readable media.