1. Field of the Invention
This invention relates generally to query processing in relational database management systems. More particularly, the present invention relates to processing aggregation queries by stacking GROUP BY expressions to produce a query graph model of an SQL query that includes either multiple grouping sets, concatenations of ROLLUPs, or CUBE. The number of GROUP BY expressions in the query graph model is reduced by taking advantage of a previously unknown intersection property of GROUP BY operations.
2. Description of the Related Art
A relational database management system (RDBMS) is a computer database management system that uses relational techniques for storing and retrieving data. Relational databases are computerized information storage and retrieval systems in which data in the form of tables (formally denominated "relations") are typically stored for use on disk drives or similar mass data storage devices. A "table" includes a set of rows (formally denominated "tuples" or "records") spanning several columns (denominated "attributes"). Reference is made to C. J. Date, An Introduction to Database Systems, 6th edition, Addison-Wesley Publishing Co. Reading, Mass. (1994) for an comprehensive general treatment of the relational database art.
An RDBMS is structured to accept commands to store, retrieve and delete data using high-level query languages such as the Structured Query Language (SQL). The term "query" denominates a set of commands for retrieving data from a stored database. The SQL standard has been promulgated by the International Standards Association since 1986. Reference is made to the SQL-92 standard "Database Language SQL" published by the ANSI as ANSI X3.135-1992 and published by the ISO as ISO/IEC 9075:1992 for the official specification of the 1992 version of the Structured Query Language. Reference is also made to James R. Groff et al. (LAN Times Guide to SQL, Osborne McGraw-Hill, Berkeley, Calif., 1994) for a lucid treatment of SQL-92, and to Don Chamberlin, Using the New DB2, Morgan Kaufman Publishers, Inc., 1996, for a discussion of that language.
Aggregate functions have been developed and implemented for SQL that produce a single value (for example, SUM) from a group of values in the SELECT (or HAVING) clause of an SQL query. The GROUP BY operator groups aggregate values into zero (single-row) or single-dimensioned (multi-row) aggregates. The GROUPING SET operator extends the GROUP BY operation to simultaneously specify the computation of multiple GROUP BYs in a single GROUP BY operation. When the GROUPING SET operator is used, the ALL value in a column denotes that the particular column is collapsed in the aggregation. The ROLLUP and CUBE operators are shorthand for GROUPING SET that represent common sets of GROUP BY operations that are required for common queries for on-line analytical processing (OLAP). Concatenation of ROLLUPs in a GROUP BY operation is shorthand for multi-dimensional analysis, common in OLAP applications.
A common use for an RDBMS is online transaction processing (OLTP). OLTP is characterized by many users creating, updating and retrieving individual database records. OLTP queries tend to be based on the retrieval of individual records.
In contrast to OLTP, the online analytical processing (OLAP) of databases is characterized by a few users running large and often complex aggregation queries, for instance to compute complex analyses over several axes of an entire database. The OLAP Grouping extensions were added to DB2 V5, inter alia to enhance that language's query aggregation power and to permit the specification of common OLAP computations in the database. These computations include the CUBE extension of GROUP BY, concatenations of ROLL-UP, and multiple grouping sets, all of which require multiple GROUP BY operations. These OLAP grouping extensions will be incorporated in the forthcoming ANSI SQL-3 standard, and reference is made to several papers accepted by ANSI and/or ISO for inclusion in the SQL-3 standard including: "Super Sets (The Cube and Beyond)", Frank Pellow, ANSI Z3H2-96-205(R3)/MAD-032(R1); "Super Sets--Concatenation and Grouping Sets:", Frank Pellow, ANSI X3H2-096-585; and the ISO-ANSI Working Draft, Database Language SQL-Part 2: Foundation (SQL/Foundation, Apr. 13, 1997. DBL:LGW-008 and X3H2-97-030. Further, reference in made to the SQL-92 standard "Database Language SQL" published by the ANSI as ANSI X3.135-1992 and published by the ISO as ISO/IEC 9075: 1992 for the official specification of the 1992 version of the Structured Query Language. Each of these authorities is herewith incorporated by reference.
The OLAP grouping extensions to DB2, incorporating the previously discussed references, implementing at least the ability to group by CUBE, ROLLUP, and multiple grouping sets, are referred to hereinafter as "OLAP grouping extensions". These functions are generalized by the following syntax:
______________________________________ &lt;group by clause&gt;::= GROUP BY &lt;grouping.sub.-- expr&gt;,... //a list &lt;grouping.sub.-- expr&gt;::= &lt;grouping.sub.-- element&gt; .vertline. (&lt;grouping.sub.-- element&gt;,...) .vertline.() &lt;grouping.sub.-- element&gt;::= &lt;scalar.sub.-- element&gt; .vertline. ROLLUP (&lt;grouping-element&gt;,...) .vertline. CUBE (&lt;grouping-element&gt;,...) .vertline. GROUPING SETS (&lt;grouping-element&gt;,...) ______________________________________
GROUPING SETS are used in conjunction with aggregation functions (MAX, MIN, SUM, AVG, COUNT, etc.) to enable multiple "groupings" in a single computational pass and are further restrictable by the HAVING clause.
Consider now, a massive database represented by a table T (entitled "SALES") where the automobile sales of an organization are entered on a daily basis, with attributes of day, week, month, year, salesperson, manager, city, region, and units sold. The following query, implementing GROUPING SETS, returns the total sales of the organization during each month of 1996, per sales region and per sales manager:
SELECTMONTH(SALES.sub.-- DATE), REGION, SALES.sub.-- MGR, SUM(SALES) PA1 FROM SALES PA1 WHERE YEAR(SALES.sub.-- DATE)=1996 PA1 GROUP BY GROUPING SETS PA1 SELECT MONTH(SALES.sub.-- DATE), REGION, SALES.sub.-- MGR, SUM(SALES) PA1 SELECT MONTHl(SALES.sub.-- DATE), REGION, SALES.sub.-- MGR, SUM(SALES) PA1 GROUP BY ROLLUP (a1,a2,a3,a4), ROLLUP(b1,b2,b3,b4), ROLLUP (c1,c2,c3,c4), PA1 GROUP BY CUBE(a1,a2,a3,a4,b1,b2,b3,b4,c1,c2,c3,c4,d1,d2,d3,d4) PA1 "Algebraic aggregates are more difficult to compute than distributive aggregates . . . An algebraic aggregate must maintain (M-tuple) for each element of the cube . . . When the core GROUP BY operation completes, the CUBE algorithm passes the set of handles to each N-1 dimensional super-aggregate. When this is done the handles of the super-aggregates are passed to the super-super-aggregates, and so on until the (ALL, ALL, . . . ,ALL) aggregate has been computed." PA1 SELECT a, b, SUM(c) PA1 FROM T PA1 GROUP BY a, b PA1 GROUP BY(GROUP BY(T, element-list1), element-list2)==GROUP BY(T, element-list1intersect element-list2). PA1 GROUPINGSETS ((MONTH(SALES.sub.-- DATE), SALES.sub.-- MGR), (MONTH(SALES.sub.-- DATE)))
((MONTH(SALES.sub.-- DATE),REGION), PA2 (MONTH(SALES.sub.-- DATE),SALES.sub.-- MGR)) PA2 FROM SALES PA2 WHERE YEAR(SALES.sub.-- DATE)=1996 PA2 GROUP BY ROLLUP (MONTH(SALES.sub.-- DATE),REGION, SALES.sub.-- MGR) PA2 FROM SALES PA2 WHERE YEAR(SALES.sub.-- DATE)=1996 PA2 GROUP BY CUBE (MONTH(SALES.sub.-- DATE),REGION, SALES.sub.-- MGR) PA2 ROLLUP(d1,d2,d3,d4)
Results for April and May are shown in FIG. 1.
The use of ROLLUPs extends grouping semantics to produce "sub-total" rows, thereby producing "regular" grouped rows. Moreover, the same groupings are re-applied down to the grand total. For example, the following query, implementing ROLLUP, returns the total sales of the organization during each month of 1996, per sales region, per sales manager pair and per sales region, with subtotals for each month, and concluding with the grand total:
Results for April and May are shown in FIG. 2.
Finally, the use of the CUBE operator further extends grouping semantics to produce multi-dimensional grouping and "sub-total" rows. Accordingly, it will be appreciated that a CUBE operator produces results that are a superset of a ROLLUP operator with the same arguments. Furthermore, CUBE is equivalent to a concatenation of ROLLUPs of each argument of the CUBE. That is, CUBE of (a, b, c, . . . ) is equivalent to ROLLUP(a) ROLLUP(b) ROLLUP(c) . . . . The CUBE operator produces "regular" grouped rows, and the same groupings are re-applied down to a grand total. Finally, it produces additional groupings on all variants of the CUBE clause.
By way of example, the following query, implementing CUBE, returns the total sales of the organization during each month of 1996, with subtotals and overall totals for each month, each region, and each sales manager, and concluding with the grand total of sales for the year 1996.
Results for April and May are shown in FIG. 3.
From each of the foregoing examples, it becomes apparent that the specification of the GROUP BY clause used in OLAP often expands into many GROUPING SETS. For example, the CUBE alone will result in 2 .sup.n GROUPING SETS. Any concatenation of p ROLLUPs results in ((n1+1).times.(n2+1).times.. . . (np+1)) GROUPING SETS, where ni is the number of elements of the i-th ROLLUP. The addition of one to each ni represents the inclusion of the grand total row in which ALL appears in every field. Each GROUPING SET requires the execution of a GROUP BY operation, in the prior art.
For example:
shows a four dimensional OLAP model, with dimensions a,b,c,d, each having a four level hierarchy. This query expands to 5.times.5.times.5.times.5=625 GROUPING SETS, which requires 625 GROUP BYs to be performed on the data.
Further,
expands to 2**16=32,768 GROUP BYs to be formed on the data.
From the foregoing, it becomes obvious that the OLAP extension to DB2 may require significant computational assets to implement. This requirement of course increases with the number of users, with the size of the database, as well as with the complexity and size of the aggregation queries submitted for computation. In order to provide the advantages of OLAP technology to database users at acceptable levels of computer overhead, efforts have been undertaken by a number of workers to make OLAP implementation more efficient.
In general, these efforts have focused on ways to reduce the number of data scans to compute the group bys. One such effort was reported in "Data Cube: A Relational Aggregation Operator Generalizing GROUP BY, Cross-Tab, and Sub-Totals", a Microsoft Technical Report by Gray et al. An extended abstract of this document was published under the same name in ICDE 1996, pg. 152-159, 12th International Conference on Data Engineering, Feb. 26-Mar. 1, 1996, New Orleans, ISBM 0-8186-7240-4, WEEE Computer Society.
According to Gray et al., the aggregate functions enabled by the OLAP extensions (termed in that work "super-aggregates") can be classified into three categories: Distributive, Algebraic and Holistic. Gray et. al teach ". . . no more efficient way of computing super-aggregates of holistic functions than the 2.sub.N algorithm using the standard GROUP BY techniques." The reference declines to comment on cubes of holistic functions.
Regarding cubes of distributive functions, Gray et. al. maintain that they are " . . . relative easy to compute . . . . Thinking in terms of the cross tab, one has a choice of computing the result by aggregating the lower row, or aggregating the right column (aggregate (ALL,*) or (*, ALL)). Either approach will give the same answer. The algorithm will be most efficient if it aggregates the smaller of the two (pick the * with the smallest C.sub.i). In this way, the super-aggregates can be computed dropping one dimension at a time."
With respect to algebraic aggregation, Gray et. al. go on to state that:
While these efforts have met with some success in limited implementations having relatively small databases accessed and queried by a small number of users, the full implementation of GROUPING SETS, ROLLUP and CUBE with multidimensional hierarchy support in large multi-user systems receiving complex queries (such as the DB2 Universal Database system available from International Business Machines Corporation) has generally not been effected prior to the present invention. This is due to the computational overhead required by the query strategies previously propounded for these aggregation functions, and to the novelty of these operations in the context of SQL.
This computational overhead is occasioned by the fact that the work of others has heretofore focused on ways to reduce the number of data scans to compute the GROUP BYs. Since this methodology has, as previously discussed, failed to enable full implementation of the OLAP GROUP BY clauses, it becomes obvious that some further means must be found to reduce the computational cost, or overhead, of the calculations necessary to support this technology.