Relational databases have become the dominant database technology for businesses to keep track of their sales, transactions, and other affairs. Designed for efficient selection, storage, and retrieval of data, relational databases are used to house large quantities of detailed data in data warehouses. The information housed in data warehouses can be analyzed to yield critical information about a business, such as sales trends and product line profitability, and can provide a critical edge in an increasingly competitive marketplace.
The data processing required to answer analytical questions is quite different from the data processing required to answer transactional questions. For example, while a transactional query might ask, “When did order 84305 ship?”, an analytical query might ask, “How do sales in the Southwestern region for this month compare with plan or with sales a year ago?” The first question involves the selection and retrieval of data usually contained in a single row, but the second question involves inter-row calculations, time series analysis, and access to aggregated historical and current data. This data processing is known as online analytical processing (OLAP). An OLAP application is a computer program designed to access the company's raw data, aggregate the data into various summaries, and present the summaries to the company's data analysts. Various kinds of aggregations may be performed, and the most common include summation (e.g. totaling sales) and finding maximums and minimums.
Aggregation is a cornerstone operation frequently done in virtually all large data warehouses. When data is aggregated on many attributes or dimensions, however, the amount of summary data becomes much larger than the amount of detail data that was input. However, conventional attempts to perform aggregation require processor time that is proportional to the amount of summary data. Since the amount of summary data outweighs the input data, aggregation can become very expensive.
Sparsity is one way to characterize input detail data. Although the detail data can be characterized by values for a number of different attributes or dimensions, there may be many combinations of values that are not available (NA) in the detail data. Sparsity is typically measured in terms of the number of logical cells that have a non-NA value relative to the total size of a wireframe, which can be conceived as a space that encloses the possible attribute or dimensional values that the detail data can take. For example, if the wireframe has three attributes or dimensions with one hundred possible values in each attribute or dimension, there are one million possible cells. If a variable dimensioned by these three dimensions has 10,000 non-NA values, then the variable is 1% dense. If there were 100 such non-NA cells, then it would be 0.01% dense. If there were only one non-NA cell, it would be 0.0001% (or 10−4%) dense.
In traditional multidimensional database systems, data is stored as a linearized multi-dimensional array. The dimensions operate as the array indexes. In this format, any cell can be retrieved very quickly by simply computing the Cartesian product of the dimension offsets that correspond with the desired values and looking that value up in the linearized array. However, when the data is sparse, many of the cells in the array are NA (i.e., null). In these cases, long stretches of the stored array are empty. Thus, the space occupied by the data is dwarfed by the space occupied by null values.
Relational databases do not generally store data multi-dimensionally. Instead, data is stored in rows, with the value of each column explicitly included. Thus, if a fact table contains a list of dimensions and a measure, each row contains data values for each dimension and the measure. By contrast, only the measure is stored in the multidimensional data model, because the dimension values are implicit in the location that data is written to. Since relational databases do not store anything for dimension combinations that have no data, they are more efficient than purely multidimensional storage for somewhat sparse data.
When data is at least 20% dense, then purely multidimensional storage is usually the most efficient, as it does not store the dimension values. However, as data density decreases, the number of values stored on a physical page gradually decreases. Similar to the relational model, the multidimensional space can be compressed down to dimension combinations where data exists. This is done by constructing a B+ tree which stores the index values of the dimensions where data can be found. This mechanism is called composite dimensions (composites). Using a composite, several dimensions that are sparse relative to each other can be transformed into a much more dense space. This densification is accomplished by using a B+ tree to map the wireframe positions of the base dimensions to a single integer offset. Composite dimensions may perform well when the number of entries within the B+ tree does not grow large. Once the composite grows too large, there may be a sharp decrease in performance. For example, this decrease may occur once the composite grows beyond the tens of millions of physical entries.
As data becomes more sparse, every dimension may be placed in the composite, thus imposing a limit on the size of any variable within the system. In order to avoid bloating the composite, it may be desirable to include one or more dense dimensions. As variables become more sparse, this desire becomes increasingly impractical. An example of this phenomenon is changing time data from monthly to daily. Typically, monthly data is relatively dense so that the monthly data will often be outside the composite. However, daily data is usually relatively sparse, so it may be desirable to put time in the composite. Historically, composites have worked well with data that is at least 10−5% dense, although this number can vary greatly depending on the number and size of dimensions.
FIG. 1 illustrates a set of three hierarchies used as a working example, in which an “A” dimension 102 might correspond to region, a “B” dimension 104 to time, and a “C” dimension 106 to product classification. In the region hierarchy in the “A” dimension 102, position A11 may correspond to Pittsburgh, position A12 to Philadelphia, and position A13 to Washington, D.C. A21 is a more general level in the hierarchy and corresponds to Pennsylvania, and A31 is an even higher level of generality, e.g. the east cost, that includes Pennsylvania (A21) and Washington D.C. The highest level in this example, A41 would correspond to the USA and would include the east coast (A31) plus California (A22). This example includes skip-level (the hierarchy is missing values at intermediate levels (e.g. the city Washington D.C. at A13 has no state level aggregate), and ragged hierarchies, where the detail exists on different logical levels. An example is a company that tracks sales at the city level on the east coast and at the state level on the west coast. Thus, the hierarchy for the “A” dimension is both skip-level (A13 goes to A31) and ragged (A22 is not at the same level as A11, A12 and A13).
The “B” dimension 104 also has a defined hierarchy, in which the most general level is B31 for sales made at any time. Positions B21 and B22 may correspond to years (e.g. 2004 and 2003, respectively), and positions B11, B12, B13, and B14 to the months of January, February, March, and December, respectively. The hierarchy in the “C” dimension 106 is simpler, in which there are two product models C16 and C17 in two lines of products (C23 and C24, respectively). The C31 position corresponds to a company's products.
Within these hierarchies, familial relationships are implicit in the structure; for example, A11 may be viewed as a child of A21, as the grandchild of A31, and as a great-grandchild of A41. Similarly, A41 may be viewed as the parent of A31 and A22, the grandparent of A21 and A13, and the great-grandparent of A11 and A12. Using graph theory terminology, A11 is connected to A21 via a single edge connection, i.e., a child is connected to its parent in the hierarchy via a single edge connection. Additionally, A41 is at a top level of the A hierarchy 102, and A11, A12, and A13 are all on a same level, which may be called a leaf level. A top-down path from A41 to A12 navigates the nodes A41, A31, A21, and A12, in order from the top of the hierarchy down, following edges connecting the respective nodes. A bottom-up path from A11 to A41 navigates the nodes A11, A21, A31, and A41, in order from the bottom of the hierarchy up.
A roll-up operation on the A hierarchy 102 navigates the nodes in a strict bottom-up direction, following the edge connecting a particular child node to its respective parent node (e.g., from A12 to A21 to A31 to A41). A drill down operation on the hierarchy navigates the nodes in a strict top-down direction, following the edge connecting a particular parent node to one of its respective children nodes (e.g., from A41 to A31 to A13). Without loss of generality, the hierarchy structure may be inverted so that the relationships and operations are described in a reverse manner.
FIG. 2 depicts three input rows 200 for the example. Each row is defined by a “tuple” that includes a position from each dimension. For example, input row <A11 B13 C17> includes a position A11 in the “A” dimension (e.g., Pittsburgh, Pa., of the east coast), position B13 in the “B” dimension (e.g. March 2004), and position C17 to product model C17. The number on the left-hand side of each row represents a measure value can be assigned to the input row. These exemplary measure values are chosen to illustrate which input rows produced which aggregate value, but in practice, measures correspond to information that a business measures, for example, to the amount of sales in thousands of dollars for a particular product in a corresponding city and month.
One exemplary operation involved in aggregating data is “roll up.” For each of the input rows a new row for each logical position is produced to represent data at this location. A new row is stored whether or not it produced a new aggregation value. Thus, the output value is stored even if it is identical to the input. This is done by processing each hierarchy in turn in order to produce the aggregation path.
FIG. 3 depicts the creation of new rows 300 by rolling up the A hierarchy. For this example, no new values are produced. Thus, for the 6 new rows produced, each has the same value as the row from which it was derived. The operation thus consumed time and space but produced no new information.
FIG. 4 depicts a rollup 400 on the B hierarchy. This aggregation is performed on the rows produced by rolling up the A hierarchy as well as the input rows. Again, after producing 17 new rows, only one (A41, B31, C16) has a new value (6). All of the other rows still have the same value as the input row from which they were derived, as only the “6” cell has more than one immediate child in the join hierarchy. All others represent the rollup of one child into its parent.
FIG. 5 depicts a rollup 500 on the C hierarchy. The additional blocks indicate rows that are produced by rolling up the C hierarchy. This rollup produces several new aggregates, as indicated by the nodes labeled with “3” and “7.” However, several times, multiple children were independently aggregated into the value 3 (e.g., <A31, B21, C31>, <A31, B31, C31> and <A41, B31, C31>). In FIG. 6, these nodes (indicated by the dotted box) were all aggregated from the same leaf nodes (1 and 2). However, even though the number is and will be identical, the addition of 1 and 2 is done multiple times, because the immediate children of each “3” node are different.
Thus, the three input rows produced a large number of new rows, but there was relatively little actual aggregation. Most new rows were generated from one input row, so required no addition. Three input rows produced 75 final output rows. In FIG. 6, it can be seen that these 75 rows can be broken into six regions where every row in a region shares the same value. Regions 6, 7, and 3 are the regions produced by aggregation. In region 3, three points of aggregation all produce the same value. The lowest point in a region where aggregation occurs is called a lower bound. In region 6, the lower bound is represented by the tuple <A41 B31 C16>.
Current OLAP customers are pushing towards data models that are larger and more sparse than can be efficiently supported by many conventional products. The observation of redundancy inherent when extremely sparse data is rolled up in a multidimensional space has been a topic in many research circles. Recent research has leaned toward avoiding storing or logically differentiating these redundant tuples. For example, “Quotient Cube: How to summarize the semantics of a data cube,” describes how an aggregated cube can be reduced to a series of classes that describe an entire aggregated cube with a fraction of the storage. If many of the values in the aggregate space repeat, an index of repeating values can be created in a logically contiguous region of a “cube lattice” and only store a single value. The build time for the structures described in the paper, however, includes a “brute force” approach which requires excessive processing resources and would accordingly not be feasible in business environments.
Therefore, there is a need for data processing of large amounts of sparse detail data that does not require infeasible amounts of processing time.