1. Field of Invention
The present invention relates generally to the field of compression. More specifically, the present invention is related to a system and method for frequency partitioning using entropy compression with fixed size fields.
2. Discussion of Related Art
Compression is an important aspect of various computing and storage systems. Prior art compression schemes suffer from the drawback of being computationally intensive.
Lossless compression of relational data is a well-studied problem. Existing compression techniques work by eliminating three kinds of redundancy in relational data: repeated values, skewed data distributions and tuple ordering.
Repeated values are very common in real-world databases. Data items like prices, names, and flags can all be quite long and may appear in many places in a dataset. Dictionary coding, the process of replacing each instance of a data item with a short codeword and using a dictionary data structure to convert codewords back into values, can reduce the size of such data items.
Skewed data distributions are also very common in real-world applications. Entropy coding is a version of dictionary coding that takes advantage of this skew by assigning shorter codes to more common values, while giving less common values longer codes. For example, while the first name column in a payroll database may support strings of up to 255 characters, in practice there may be only a few thousand names, and among these some names are much more common than others. By using a dictionary to store the values, we save on repeated names, and by using an entropy code to index the dictionary, we save bits for representing the most common values.
Entropy compression comprises a range of techniques for compressing data close to its entropy, the theoretical limit of compressability as defined by Shannon's Information Theory. Entropy compression techniques must exploit skew, differences in the frequencies of data values or combinations of values. Huffman coding and arithmetic coding are commonly used techniques for entropy coding. In either scheme, frequent data values are represented using short codes, less frequent values are represented with middle length codes, and infrequent values are represented with longer codes.
Seemingly inherent in entropy compression is the property that they result in sequences of variable length codes. This is a problem because as the codes are variable length, we need to determine the length of code i before we can start parsing code i+1, because otherwise we would not know where code i+1 begins. Dealing with the codes one by one reduces the ability to parallelize the processing of many codes in a long sequence.
Making efficient use of modern processors requires using parallelism. Modern processors have three forms of parallelism:                Processors contain multiple cores which can run independent threads or processes.        Each core can itself exploit Instruction Level Parallelism, where a processor can execute several instructions simultaneously as long as those instructions do not depend on each other.        Each instruction can exploit data parallelism, where long registers (64 or 128 bits in most cases) or vectors contain many data items packed closely together and manipulated as a unit.        
Sequences of variable length codes make it hard to take advantage of instruction level or data level parallelism, and that limits the effectiveness of each core, slowing down the rate at which data values can be processed from 4 to 16 times, depending on the number of functional units and the width of the registers or vectors. Core level parallelism is not affected by traditional entropy encoding, it is done intelligently.
As mentioned earlier, a well-known type of entropy coding is Huffman coding, which produces prefix codes. In Huffman coding, shorter codewords are guaranteed not to be prefixes of longer codewords. As a result, each codeword implicitly encodes its own length as well as a value of the code. This property allows a compression system to pack codewords of different lengths together. During decompression, the system uses the implicit length information to find the boundaries of the packed codewords.
US Patent Publication 2005/0055367 teaches a method for breaking up an input into windows, such that each window has data of different frequency distribution, and using a separate Huffman dictionary for each window. But, as noted above, Huffman coding uses variable length codes and provides no teaching or suggestion for using fixed length codes. Further, such techniques based on Huffman coding do not partition the input and use fixed length codes in each window. Further, US Patent Publication 2005/0055367 partitions ordered input, and so can only split it into windows that respect the given ordering of the input. However, no teaching or suggestion is provided by such techniques for partitioning databases which are unordered collections of records—so any partitioning can be chosen (in particular multi-dimensional partitioning) that involves reordering the rows of the database.
Relational compression techniques also reduce the size of data by stripping out tuple ordering. Relations are sets, so any information about the order of tuples in a relation is redundant information. A system can remove this redundancy by sorting and delta-encoding compressed tuples. Instead of storing the binary representation of every tuple directly, delta-encoding represents each bit string as a difference, or delta, from the previous tuple's bit string. Since these deltas are relatively small numbers, they can be encoded in fewer bits than the compressed tuples, and can be further compressed using an entropy code.
The combination of using variable-length codewords, sorting, and delta encoding compresses relational data to within a constant factor of its absolute minimum size, or entropy.
In the context of an OLAP star schema, this compression typically proceeds in three passes over the data. First, the system analyzes the data to determine an optimal coding of the values from each column so as to approach entropy. Then it joins the fact table and dimension tables to form a single “universal” relation and at the same time converts each tuple of the relation into a tuplecode, or concatenation of codewords, by dictionary-coding each field. The final pass involves sorting and delta-encoding the tuplecodes and writing out the encoded delta values. Decompression happens in the reverse order: the delta-coding is undone and individual fields are decoded as needed, using the dictionary.
As we noted previously, variable-length dictionary codes are essential to achieving acceptable compression ratios when compressing relational data. Unfortunately, variable-length codes are also a major source of CPU overhead in today's compressed databases.
The compressor packs the individual compressed field codes of a tuple into a tuplecode. To access the i'th field of a tuplecode, the system must parse fields 1 through i−1 to determine their code lengths. This parsing creates control and data dependencies that severely impact performance on modern processors. Worse, these dependencies frustrate the goal of avoiding any decompression costs. We would like to avoid accessing the portions of the tuple that are not relevant to the query, but the cascading field offsets within a tuplecode force the system to compute many more code lengths than are necessary. Such overhead is a well known problem in the prior art.
What is needed is a technique that somehow achieves entropy compression while resulting in sequences of fixed length codes that can be processed efficiently within a core.
Whatever the precise merits, features, and advantages of the above cited prior art techniques, none of them achieves or fulfills the purposes of the present invention.