1. Field of Invention
The present invention relates generally to the field of databases. More specifically, the present invention is related to adaptive cell specific dictionaries for frequency partitioned multi-dimensional data.
2. Discussion of Related Art
IBM Smart Analytic Optimizer uses a multi-dimensional partitioning scheme known as Frequency Partitioning (in short, FP) that partitions a table into cells. Each column of a table is logically divided into disjoint column partitions according to occurrence frequencies and the cross-product of column partitions defines the cells. Each column may be compressed using an encoding that is specified by a Dictionary. The dictionary for a column defines the column's partitioning and the encoding within a partition is fixed-length.
FIG. 1 shows an example table with three columns where the first column (A) and the third column (C) are split into 3 partitions while the middle column (B) is split only into 2 partitions. This results in 3×2×3 different cells. FIG. 2 illustrates the resulting cells from the partitioning example discussed above. The code lengths within such a cell are constant across all triples. The cell A2.B1.C2 for example has all tuple values encoded with 5+4+5=14 Bits.
With increasing amounts of columns and partitions, the amount of cells is increasing exponentially. Because a cell is created for each possible combination of column partitions, there are multiple cells which are only sparsely populated (or even empty) because the tuple value combination which is implicitly represented by the cell occurs only rarely within the real table data.
There exist costs associated with each cell with regard to several aspects:                Overhead for administrative meta data and data structure alignments (“pages”),        Processing overhead, as the encodings and tuple alignments are different for each cell.        
Therefore the amount of cells is supposed to be limited and the amount of tuples per cell should exceed a pre-determined “cell threshold” to compensate for the cell costs. Cells with only few tuples are not desirable. Frequency partitioning can therefore only be applied up to a point where the overhead for creating more cells outweighs the benefit of shorted encoding; the amount of cells that is practical for the given application is called the “cell budget” for the partitioning.
Experiments with typical database tables have revealed the following facts:                Tables in large customer databases often have 50 to several 100 columns; with the described approach, splitting only 30 of these columns into only two partitions each would lead to more than 1 billion cells with the described approach; this is typically the same order of magnitude as there are rows found in the table. Therefore the necessity to treat all combinations of column partitions as separate cells puts a significant limit on the degree to which partitioning can be applied.        Partitioning a column is most beneficial when its value distribution is highly skewed, i.e. a small part of the occurring values has a very large probability. In a typical case, a single default for a column value (often NULL) occurs in 90% or more of all rows, while all other values have only a combined probability of 10%. If such 90/10 distributions are applied to a number of columns, the size of the resulting cells will differ very much even under the assumption of statistical independence of columns—correlation of columns in the actual data even amplifies the problem. Often a small number of cells receive most of the tuples while many cells are almost (or even completely) empty.        
As a result, if the described FP approach is applied until a given cell budget is exhausted, this often leads to encodings where most of the rows of actual data go into only a few “large” cells. On the other hand, there are many “small” cells which receive so few rows that the overhead for managing an extra cell clearly outweighs the benefits of shorter encoding.
Another observation that can be made is the fact that we use the fixed length codes of the column partition to encode the values within a specific cell. With the same example of cell A2.B1.C2, we use 4 hits to encode all values of column B within that cell, but the actual count of distinct values for column B within cell A2.B1.C2 might be far less which implies that the values within the cell could be encoded with less than 4 bits. The count of distinct values within a cell can be far less than the count of distinct values within a column partition due to the following scenarios:                Correlated column: The column partitioning has been done for each column individually without taking correlation between columns into account. In an example where we have two columns like car make and car model and we have values like Honda/Accord and BMW/735i, we might end up with Honda and BMW in different column partitions of the make column, and Accord and 735i in different column partitions of the model column. We will still create cells for the combinations of all partitions but the real data will never contain the combinations Honda/735i or BMW/Accord, so the created cells will be very sparse and inefficiently encoded.        The combination of all column partitions leads to cells with few tuples only. This low amount of tuples will probably show a far smaller distinct count for each column than the original column partitions. Still using the fixed code width of the column partition therefore leads to a waste of bits per value.        
Many of the above mentioned problems result from the use of a global column-wise dictionary, which forces the cells to be strictly cross-products of the partitions and wastes bits since it cannot take advantage of data locality. Our solution to the above problems is based on cell-specific dictionaries that are applied adaptively to adequate cells. The introduction of cell-specific dictionaries allows extensions of several different aspects of the frequency partitioning algorithm that lead to significant compression improvements for common cases of frequency-partitioned multi-dimensional data like those described above.
Multi-dimensional indexing methods, such as R-trees and Kd-Trees, partition a multi-dimensional value space into (possibly overlapping) hyper-rectangles. These trees often have non-grid like partitioning, which superficially resemble the partitioning formed by merging or splitting the cells after FP (as described later). Methods for keeping R-trees balanced often involve splitting large rectangles or merging small rectangles. However, multi-dimensional indexes partition the space of values, and not the actual data records themselves. In multi-dimensional indexes, each rectangle corresponds to (along each attribute) a contiguous range of values and values are not clustered into a cell based on frequency considerations. For example, if a column contains names of fruits, in a multi-dimensional index, if a rectangle holds apple and mango, the same must also hold the intermediate values banana and grape.
Using multiple dictionaries to compress a data set, including local dictionaries for blocks of data is a common practice in the data compression literature. The widely used Lempel-Ziv algorithms and their variants all use some form of block-level dictionary. There are also existing methods that use one of multiple dictionaries in an adaptive fashion for compression. The cell-specific dictionaries described in this patent application differ in that a cell is not a physical unit like a block—which is made up of a contiguous region of an input database, but instead a logical unit which is defined by specific values for the record attributes. Records in a FP cell need not necessarily be physically clustered together.
U.S. Pat. No. (6,161,105) to Keighan et al. teaches an improved database data structure and datatype for storing, manipulating and accessing multidimensional spatial data in a database. However, Keighan et al., and the well-known quad tree algorithms related to the patent, do not use frequency partitioning with dictionary-encoding and the cells that are derived by FP are also not used.
Embodiments of the present invention are an improvement over prior art systems and methods.