On-Line Analytical Processing (OLAP) has recently emerged as an important business application of aggregate databases built from data warehouses. The output from the OLAP application may be used for synthesizing new data or extracting useful patterns from the database, as in data mining applications. An increasingly popular data model for OLAP applications is the multi-dimensional database (MDDB), also known as data cube. The data cube model is described, for example, by J. Gray et al. in "Data Cube: A Relational Aggregate Operator Generalizing Group-bys, Cross-tabs and Sub-totals," Proc. of the 12th Int'l Conf. On Data Engineering, pp. 152-159, 1996.
To build an MDDB from a data warehouse, certain attributes (typically from 5 to 10) of the data records are selected. Each data record of the MDDB contains a value for each of the data attributes. One of these attributes is typically chosen as a metric of interest and referred to as a measure attribute. The remaining attributes, say d of them, are referred to as functional attributes which correspond to the dimensions of the data cube. The measure attribute values of all the records having the same combination of functional attributes are combined (e.g., summed up) into a single value. Thus, an MDDB can be viewed as a d-dimensional array, indexed by the values of the d functional attributes, whose cells contain the values of the measure attribute for the corresponding combination of the functional attribute values.
For example, a data cube representing data for an insurance company may have four dimensions corresponding to the functional attributes age, year, state, and insurance type, respectively. Assuming that the domain of age is from 1 to 100, of year is from 1985 to 1996, of state is the 50 states in U.S., and of insurance type is the set {home, auto, health} representing the type of insurance purchased by a customer. The data cube thus has 100.times.12.times.50.times.3 cells, with each cell containing the total revenue (the measure attribute in this case) for the corresponding combination of age, year, state, and type attribute values.
Furthermore, the above data cube may be extended by augmenting the domain of each functional attribute with an additional value, denoted by "all", to store the values of the measure attribute in all of the cells along that functional attribute. In the above example, the data cube will be extended to include 101.times.13.times.51.times.4 cells. With this extension, any query of the form (age, year, state, type), where the queried value for each attribute is either a single value in its domain or all, can be answered by accessing a single cell in the extended data cube. For instance, the total amount of revenue for the auto insurance in the whole U.S. in 1995 is a query specified by (all, 1995, all, auto), which can be answered in one cell access. Such queries are referred to as singleton queries.
An important class of queries in multi-dimensional databases involves the SUM aggregation of a group of cells selected from the data cube, where the values of interest for some functional attributes are specified as subsets, not necessarily in contiguous ranges, in their domains. These are referred to as partial-sum queries, and are frequently applied to categorical attributes due to their lack of natural semantics in ordering. For instance, the 50 states in the United States may be ordered alphabetically. However, a query may select the top 10 states with the largest populations, all states that are in the Pacific Standard Time zone, or all states that have one or more NFL football teams. In all these cases, the selected values of the functional attribute state are not contiguous because the states of interest are not necessarily in an alphabetical order.
The partial-sum queries may also apply to numeric attributes. For instance, if the functional attribute year is changed to year-quarter or year-month, then a query may select all first quarters from the past 10 years, or may select all December months from the past 10 years. In both cases, the selected values of the functional attributes year-quarter or year-month are not contiguous.
In an interactive exploration of the data cube, which is the predominant OLAP environment, it is essential that the query system produces a result with an almost instantaneous response time. It is thus desirable to have a method that can generate a partial-sum from the data cube quickly, regardless of the size of the region of the data cube being queried upon and without naively summing the necessary cells one-by-one.
In the assignee's pending patent application Ser. No. 08/764,564, a method is described for generating a range-sum query in a data cube, where the query ranges are contiguous. The method comprises the steps of selecting a subset of the data cube dimensions, computing a set of prefix-sums along the selected dimensions using the aggregate values, and generating a range-sum based on the computed prefix-sums. However, such a method cannot be applied to partial-sums because in many partial-sum query instances, the selection over the domain of a functional attribute is not necessarily contiguous.
In another pending application of the assignee, Ser. No. 08/808,046, a method is described for performing a range max/min query in a database represented as a multi-dimensional data cube. The method comprises the steps of: (i) partitioning the data cube into a multi-level structure of d-dimensional blocks, (ii) representing the blocks as a multi-level tree structure where the tree nodes correspond to the blocks, (iii) for each block, determining the index to the cell with the maximum or minimum value, (iv) storing the cell indexes in the corresponding nodes, and (v) generating the range max/min result from the values of the cells selected from those in the query region. However, the method described there requires that the selection over a functional attribute be contiguous and that the aggregation operator is MAX or MIN, rather than SUM or other similar binary operators. Therefore, this method cannot be used for a partial-sum query.
In the paper "Implementing Data Cubes Efficiently," Proc. of the ACM SIGMOD Conf. on Management of Data, June 1996, V. Harinarayan et al. describe techniques for efficiently constructing a data cube by determining what subset of the data cube to pre-compute in answering a singleton query. Since many data cells must be examined in a typical partial-sum query, as opposed to a single cell access in a singleton query, the methods described by Harinarayan et al. are not suitable for answering a partial-sum query.
Still other methods for processing multi-dimensional aggregates are described by J. Srivastava et al. ("TBSAM: An Access Method for Efficient Processing of Statistical Queries," IEEE Transaction on Knowledge and Data Engineering, 1(4), 1989). These methods, however, are designed for indexing the pre-computed aggregates or for incrementally maintaining them, but not for efficiently performing partial-sum queries over multiple dimensions of the data cube.
Therefore, there remains a need for an efficient method for performing partial-sum queries in a database in which data is represented as a multi-dimensional data cube. The method allows queries with a selection of cells, not necessarily contiguous, over multiple attribute dimensions to be processed quickly, and takes advantage of an inverse aggregate operator in processing the queries.