Decision support is rapidly becoming a key technology for business success. Decision support allows a business to deduce useful informnation, usually referred to as a data warehouse, from an operational database. While the operational database maintains state information, the data warehouse typically maintains historical information. Users of data warehouses are generally more interested in identifying trends rather than looking at individual records in isolation. Decision support queries are thus more computationally intensive and make heavy use of aggregation. This can result in long completion delays and unacceptable productivity constraints.
Some known techniques used to reduce delays are to pre-compute frequently asked queries, or to use sampling techniques, or both. In particular, applying online analytical processing (OLAP) techniques such as data cubes on very large relational databases or data warehouses for decision support has received increasing attention recently (see e.g., Jim Gray, Adam Bosworth, Andrew Layman, and Hamid Pirahesh, "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals", International Conference on Data Engineering, 1996, New Orleans, pp. 152-160) ("Gray"). Here, users typically view the historical data from data warehouses as multidimensional data cubes. Each cell (or lattice point) in the cube is a view consisting of an aggregation of interests, such as total sales.
Commonly encountered aggregation queries for data warehouse applications include those already defined in the standard Structured Query Language (SQL), such as the following five aggregate functions:
AVG(); PA1 COUNT(); PA1 SUM(); PA1 MIN(); PA1 MAX(); PA1 rank(expression): returns the expression's rank in the set of all values of this domain of the table. If there are N values in the column, the rank is N if the expression is the highest value, and the rank is 1 if the expression is the lowest value; PA1 N.sub.-- tile(expression, n): the range of the expression is computed and divided into n intervals of approximately equal population. The function returns the number of intervals holding the value of the expression; and PA1 Ratio.sub.-- to.sub.-- Total(expression): sums the expression and then divides the expression by the total sum. PA1 Cumulative(expression): cumulative sums of values in an ordered list; PA1 Running.sub.-- sum(expression,n): sums the most recent n values in an ordered list; and PA1 Running.sub.-- average(expression,n): averages the most recent n values in an ordered list. PA1 minimizing data movement and consequent processing costs by computing aggregates at the lowest possible levels; PA1 using arrays or hash tables to organize aggregation in memory; PA1 if the aggregation values are large strings, using a hashed symbol table to map each string to an integer; PA1 if the number of aggregates is too large to fit in memory, using sorting or hybrid hashing to organize the data by value and then aggregate with a sequential scan of the sorted data; and PA1 if the source data spans many disks or nodes, using parallelism to aggregate each partition and then coalesce these aggregates.
Red Brick Systems added three more aggregate functions:
Red Brick Systems also offers three cumulative aggregate functions:
In Gray, the aggregation concept is generalized to N-dimensional data cubes and the SQL query syntax is extended to support histograms, decorations, and CUBE operators. Creating the cubes requires generating the power set (set of all subsets) of the aggregation columns. If there are N attributes in the select list, there will be 2 N-1 super-aggregate values.
As discussed in Gray, the main techniques for computing the CUBE include:
These techniques are generally useful, but lack the flexibility to deliver results in a progressive fashion. In other words, these techniques cannot stage the results to users, e.g., to provide approximate results at first and more accurate results later, as needed.
As discussed earlier, it is essential for data warehouse queries to be performed efficiently. Several approximation techniques can be used to either sample the original database records or the data cube to compute aggregation. Both random sampling and uniform sampling are the most obvious approaches. Both random sampling and uniform sampling techniques are well known in the art. Its application to database sampling can be found, for example, in "Random Sampling from Databases: A Survey" by F. Olken and D. Rotem (Statistical Computing, vol. 5, no. 1, March 1995, pp. 25-92). Similarly, uniform sampling techniques can be applied to a database to reduce the amount of information that must be processed. However, neither uniform nor random sampling allows the generation of a pyramid representation of the data cube, as the sampling rate is not linearly correlated with the accuracy. As a result, it is difficult to trade accuracy for speed when responding to queries.
Thus, there is a need for an improved method and system for generating multiple representations of a data cube for OLAP Applications. The system and method should provide the flexibility to deliver results in a progressive fashion and/or enable the trade of accuracy for speed when responding to queries. The present invention addresses such a need.