The present invention relates to data cubes and methods of assembling data cubes from relational databases.
Data cubes represent an efficient data structure for analyzing large amounts of data. Often, practical advantages are obtained from analyzing raw data by a variety of data attributes or combinations of attributes. For example, a retail business may record data for every sales transaction the business completes. Significant advantages may be obtained by analyzing the recorded transactions to determine sales levels over time, by product, by geographic region or by a host of other attributes. Additionally, data may be analyzed by a combination of attributes instead of a single attribute. Data cubes are suited to such analysis.
The data cube was first introduced in Gray, et al., Data Cube: A Relational Aggregation Operator Generalizina Group-By, Cross-Tab, and Sub-Totals, Proceedings of the IEEE International Conference on Data Engineering, pp. 152-159 (1996) ("Gray"). The data cube may be envisioned as an n dimensional cuboid having n-1 and lower dimensional aggregates based upon the cuboid. Gray provides an example of a three dimensional data cube, reproduced in FIG. 1, tailored to a relational database containing transactions for sales of automobiles. The data cube includes a three dimensional core cuboid, three two dimensional cuboids, three one dimensional cuboids, and a zero dimensional cuboid. In the three dimensional cuboid, auto sales are grouped by make, color and by year. Each entry ("tuple") within the core cuboid contains a value for total sales for a particular make, color and year. These three attributes, make, color and year, are required to identify the tuples in the core cuboid of FIG. 1(a).
Lower dimensional aggregates reflect information from the database grouped by a fewer number of attributes than the tuples of the core cuboid. For example, a two dimensional data space, shown in FIG. 1(b), may be created to monitor sales by make and year, but across all colors. The two dimensional space is also a cuboid. The information contained within the two dimensional cuboid is derived from the information contained in the three dimensional cuboid. Further, a one dimensional cuboid may be created to monitor sales by year, across all makes and all colors. Again, the information contained within the one dimensional cuboid may be derived from the information contained in an adjacent two dimensional cuboid. Finally, a zero dimensional cuboid (a point cuboid) may be created from any adjacent one dimensional cuboid. In the example provided in Gray, the zero dimensional cuboid totals sales information across all makes, all colors and all years.
Virtually any type of data analysis may be performed to construct a data cube, not merely totaling. Accordingly, data cubes promise to be a very powerful means of analyzing large relational databases.
Costs associated with building data cubes are substantial. An n dimensional data cube possesses 2.sup.n cuboids. Each cuboid may contain a number of tuples equal to the product of the number of values that each attribute takes within the database. For example, if a business analyzed 1997 sales by state and by week, a two dimensional data cube generated therefrom may possess as many as 2600 (50*52) tuples. It is easy to imagine additional attributes (product, payment type, etc.) that would increase the volume of the data cube geometrically. Thus, efficiencies in generating data cubes must be realized.
Several techniques have been proposed to build data cubes. Many are predicated on search lattices which describe relationships among the cuboids of a data cube. An exemplary lattice is shown in FIG. 2 for a four dimensional data cube having attributes A, B, C, and D. Node (A,B,C,D) 100 references tuples of the core cuboid. Because four attributes are required to address tuples of that node, the (A,B,C,D) node 100 has the finest granularity. Conversely, node .phi.250 is said to have coarsest granularity because it describes a cuboid tuple indexed by the least number of attributes (zero attributes). Intermediate nodes reference cuboid tuples that are indexed by an intermediate number of attributes.
Interconnecting lines shown in FIG. 2 describe relations among the various cuboids. Lower dimensional cuboids may be calculated from certain higher dimensional cuboids. Interconnecting lines identify those cuboids that may be calculated from others. For example, the (A,B,C) 110 and (A,B,D) 120 cuboids each may be calculated from the cuboid (A,B,C,D) 100. Thus, (A,B,C,D) 100 is a "parent" of (A,B,C) 110 and (A,B,D) 120; they are the parent's "children." Cuboids (A,B) 150 and (A,D) 170, in turn, are children of cuboid (A,B,D) 120. However, cuboid (A,C) 160 is not a child of (A,B,D) 120. Tuples of cuboid (A,B,D) 120 are not grouped by attribute C and, therefore, cannot provide information necessary for tuples grouped by attributes A and C.
Data cubes may be calculated by a variety of means. A first technique is proposed by Gray. First, Gray passes through a relational database and calculates tuples for a core cuboid, the finest granularity cuboid. For each record in the database R, an array cell indexed by the values of the "group by" attribute is updated based on data of the new record. Second, Gray calculates lower dimensional cuboids. Gray achieves that computational savings by calculating tuples of each cuboid from the smallest available parent cuboid. While (B,C) may be computed from either (A,B,C) or (B,C,D), Gray selects the smallest of these parents based on the number of tuples of each.
Gray's technique suffers performance disadvantages because the core cuboid must fit entirely in main memory; often Gray's technique is inapplicable to databases which contain sparse data. Sparseness occurs when a small percentage of the space of possible tuples are actually present in a data cube. Real world data is frequently very large and very sparse. Thus, Gray's technique is of limited value.
A second technique, called "Pipesort," is found in Agarwal, et al., On the Computation of Multidimensional Aggregates, Proceedings of the 22nd VLDB Conference (1996). Pipesort follows paths in the search lattice of the data cube. In the example of FIG. 2, the raw data may be sorted first in a first attribute order, such as A-B-C-D. Having sorted the data, cuboids sharing some part of the sort order may be calculated. These cuboids are said to have a "prefix" common to the sort order. In this example, (A,B,C,D), (A,B,C), (A,B), A and .phi. are calculated. The sort order chosen is based on estimates of computational complexity and an attempt to incorporate the smallest-parent objective of Gray. Once the first path is computed, Pipesort sorts the highest dimensional cuboid (A,B,C,D) based on a second sort order to pursue a second path. Pipesort may perform as few as (.sub.[n/2].sup.n) sorts, where n is the number of attributes for the data cube, but the number of sorts performed in practice is often much higher. Pipesort does not guarantee that it follows a minimum number of paths through the lattice.
Pipesort incurs high computational expenses. Often, the cuboid to be sorted does not fit entirely into a computer's on-board memory. By sorting the cuboid using the slower external memory, Pipesort incurs high input/output costs. In an n dimensional data cube, the input/output costs are exponential in n (i.e., .congruent.2.sup.n).
There is a need in the art for a method of computing data cubes efficiently with low expense. There is a need in the art for a method of computing data cubes that minimize read/write operations. Further, there is a need in the art for a method of computing data cubes utilizing sort paths that traverse a minimal number of paths through a lattice of the cube.