1. Technical Field
The present invention relates to OLAP (On-Line Analytical Processing) and data warehouse applications. Specifically, it relates to data structures that map multidimensional data onto linear storage mediums.
2. Description of Prior Art
Data Warehouse and OLAP applications, have highlighted the need for fast and efficient methods to dynamically and symmetrically store, maintain, and retrieve multidimensional data. Ultimately, such methods would simultaneously provide real-time update and query capabilities from any combination of participating dimensions or dimension groups. To date, other multidimensional storage and retrieval methods have not been able to provide an algorithm and data structure that simultaneously delivers dynamic maintenance capabilities, efficient storage and operation, and symmetric clustering from any combination of participating dimensions or groups of dimensions.
Hypercubes are the classic commercial approach. Hypercubes store each possible combination of values of all participating dimensions in their data structures. This approach is very inefficient in space usage, I/O (Input/Output) operations, and computer processing cycles. Hypercubes are also not symmetric. The position or column number of a dimension in a Hypercube along with dimension cardinalities determine the dimension's degree of clustering in the data. A dimension with position 0 in a Hypercube with a high cardinality might dominate all clustering in the data while a dimension with position N−1 where N is the number of dimensions in the Hypercube and a low cardinality might not participate in the clustering of the data at all. Compression assists Hypercubes with storage efficiency but further skews symmetry amongst dimensions. Finally, Hypercubes are not dynamic. The Hypercube data structure allocates a predetermined number of cells and requires full reorganization to alter this predetermined number of cells.
Conventional grid or tree approaches with multidimensional clustering such as grid files, hB-trees, MDB trees, R-trees, and the like have some success in symmetrically clustering data but fail to dynamically and effectively maintain symmetry and are not efficient in terms of computational complexity, I/O, computer processing cycles, memory, disk space, concurrency management, predictability, and recoverability.
More recently, relational database vendors have adopted bit mapped indexes. These data structures do not truly provide symmetry. They typically provide one bit mapped index for each participating dimension. These indexes are usually compressed and are very small in terms of disk storage space. Access is available from any combination of dimensions. But, the technique does nothing to cluster the underlying data. As a result, bit mapped indexes do not enforce symmetry. If dimension A determines the sort order or clustering of the multidimensional data and dimension A has a high enough cardinality so that all multidimensional elements associated with each key value of dimension A require no more than one complete database block, then despite all bit mapped indexes any dimension other than A will necessitate a full scan of every block in the multidimensional data.
Less dynamic data structures can more easily provide true symmetry. U.S. Pat. No. 6,003,036 details one such example. This technique uses interval partitioning to ensure that each dimension or dimension group of two or more dimensions participates equally in the partitioning process and successfully stores multidimensional data with one or more dimensions or dimension groups. It is also efficient in terms of complexity, computer processing time, I/O, recoverability, concurrency, and disk space. However, it does not allow unlimited updates to partitions without complete data reorganization when the data exceeds predetermined thresholds.
Z-ordering and other multidimensional mapping curves are relatively new and promising approaches to multidimensional data clustering. They involve interleaving of bits from dimension key values from predetermined ranges of numbers in order to provide symmetry for each participating dimension. In the form of Universal or UB-Trees, these techniques take advantage of already well established and proven B-trees or their derivatives are therefore very efficient. They fall short of true symmetry enforcement, however. With Z-ordering, symmetry depends on the number of bits in each dimension key value. If dimension A has 256 distinct members and is able to represent them all with 8 bits while dimension B has 100 distinct members but requires 16 bits to represent them all, then dimension A might dominate the clustering process and starve dimension B in its clustering thus compromising symmetry. Hilbert Orderings and Gray Coding are attempts to improve this technique for some multidimensional applications but still fall short of controlling symmetry in OLAP applications because they still must predetermine the number of bits for each participating dimension. Z-ordering and its derivatives are also not truly dynamic since they must set or predetermine upper ranges on each dimension. With Z-ordering, Hilbert Orderings, and Gray Coding a tradeoff exists between symmetry and the upper limit of each dimension or between dynamic operation and symmetry.
U.S. Pat. No. 6,460,026 uses the technique of multidimensional data ordering or prioritizing participating sets of dimensions hierarchically and uses these priorities in the overall ordering or clustering process. These dimension sets are the same as dimension groups or groups of keys in U.S. Pat. No. 6,003,036. The overall technique in U.S. Pat. No. 6,460,026 is more dynamic than the technique in U.S. Pat. No. 6,003,036 but does not enforce symmetry as well as the technique in U.S. Pat. No. 6,003,036. In addition, the technique in U.S. Pat. No. 6,460,026 is not fully dynamic.
In regard to symmetry, when the technique in U.S. Pat. No. 6,460,026 uses dimensional priorities for simple or multiple orderings, the first few dimensions in the orderings have the ability to completely dominate the clustering process depending on cardinality. In general, the clustering factor of each dimension participating in a non-symmetric ordering depends on its position or priority in the over all ordering. Despite the fact that this method also works with Z-ordering, Hilbert Orderings, and Gray Coding, none of these techniques completely correct potential symmetry flaws. For example, if multidimensional data ordering uses Z-ordering and defines a dimension group or hierarchy of dimensions G containing dimensions A, B, and C requiring a large number of bits for each dimension and another dimension H such that H clusters the data with a small number of bits, this technique is not likely to provide symmetric access to the multidimensional data. The technique in U.S. Pat. No. 6,460,026 is unable to enforce true symmetry but rather is susceptible to data content because of Z-ordering or other static bit interleaving techniques as mentioned previously and the various ordering priorities that U.S. Pat. No. 6,460,026 discloses. The technique in U.S. Pat. No. 6,460,026 also has symmetry problems relating to hierarchies or multiple levels in dimension groups. Multidimensional data ordering must designate a predetermined number of bits for all children of all parents in a hierarchy because of the rules of Z-ordering and the other static bit interleaving techniques. Different parents will typically have different numbers of children and will therefore need different numbers of bits. Extra, unnecessary bits have potential to adversely affect symmetry. These flaws or weaknesses in U.S. Pat. No. 6,460,026 are crucial since the primary rationale for multidimensional data structures is to provide symmetry amongst participating dimensions or dimension groups.
The technique in U.S. Pat. No. 6,460,026 is also not dynamic. When the technique employs Z-ordering, Hilbert Orderings, or Gray Coding, it must predetermine the number of bits for each dimension and each level within a dimension group. The predetermined number of bits for each level in a dimension group must be the maximum for all parents on the previous level in the entirety of the data. If the number of members or elements in the dimension increases beyond what the predetermined number of bits can represent, then the entire data structure requires reorganization before it can grow beyond the predetermined limits for each dimension or level within a dimension group. If the predetermined limit is set too high for some dimensions or levels within dimension groups, then the algorithm in U.S. Pat. No. 6,460,026 might not enforce symmetry effectively because of the unused bits or padding.
In summary U.S. Pat. No. 6,460,026 can not dynamically maintain symmetry. The invention contained herein efficiently maintains symmetry as well or better than U.S. Pat. No. 6,460,026 and without predetermined limits on dimensions that prevent dynamic operation. Unlike the multidimensional data ordering technique in U.S. Pat. No. 6,460,026, the invention disclosed herein does not require a tradeoff between dynamic operation and symmetry.