The present disclosure relates generally to compression, and more specifically, to data compression using dictionary encoding.
Compression is an important aspect of various computing and storage systems. Most prior art compression schemes suffer from the drawback of being computationally intensive or do not provide an efficient compression.
Lossless compression of relational data is a well-studied problem. Existing compression techniques work by eliminating three kinds of redundancy in relational data: repeated values, skewed data distributions and tuple ordering.
Repeated values are very common in real-world databases. Data items like prices, names, and flags can all be quite long and may appear in many places in a dataset. Dictionary coding, the process of replacing each instance of a data item with a short code-word and using a dictionary data structure to convert code-words back into values, can reduce the size of such data items.
Skewed data distributions are also very common in real-world applications. Entropy coding is a version of dictionary coding that takes advantage of this skew by assigning shorter codes to more common values, while giving less common values longer codes. For example, while the first name column in a payroll database may support strings of up to 255 characters, in practice there may be only a few thousand names, and among these some names are much more common than others. By using a dictionary to store the values, we save on repeated names, and by using an entropy code to index the dictionary, we save bits for representing the most common values.
Entropy compression comprises a range of techniques for compressing data close to its entropy, the theoretical limit of compressability as defined by Shannon's Information Theory. Entropy compression techniques must exploit skew, differences in the frequencies of data values or combinations of values. Huffman coding and arithmetic coding are commonly used techniques for entropy coding. In either scheme, frequent data values are represented using short codes, less frequent values are represented with middle length codes, and infrequent values are represented with longer codes.
Seemingly inherent in entropy compression is the property that they result in sequences of variable length codes. This is a problem because as the codes are variable length, we need to determine the length of code i before we can start parsing code i+1, because otherwise we would not know where code i+1 begins. Dealing with the codes one by one reduces the ability to parallelize the processing of many codes in a long sequence.
Making efficient use of modern processors requires using parallelism. Modern processors have three forms of parallelism. Processors contain multiple cores which can run independent threads or processes. Each core can itself exploit Instruction Level Parallelism, where a processor can execute several instructions simultaneously as long as those instructions do not depend on each other. Each instruction can exploit data parallelism, where long registers (64 or 128 bits in most cases) or vectors contain many data items packed closely together and manipulated as a unit.
Sequences of variable length codes make it hard to take advantage of instruction level or data level parallelism, and that limits the effectiveness of each core, slowing down the rate at which data values can be processed from 4 to 16 times, depending on the number of functional units and the width of the registers or vectors. Core level parallelism is not affected by traditional entropy encoding, it is done intelligently.
As mentioned earlier, a well-known type of entropy coding is Huffman coding, which produces prefix codes. In Huffman coding, shorter code-words are guaranteed not to be prefixes of longer code-words. As a result, each code-word implicitly encodes its own length as well as a value of the code. This property allows a compression system to pack code-words of different lengths together. During decompression, the system uses the implicit length information to find the boundaries of the packed code-words.
Relational compression techniques also reduce the size of data by stripping out tuple ordering. Relations are sets, so any information about the order of tuples in a relation is redundant information. A system can remove this redundancy by sorting and delta-encoding compressed tuples. Instead of storing the binary representation of every tuple directly, delta-encoding represents each bit string as a difference, or delta, from the previous tuple's bit string. Since these deltas are relatively small numbers, they can be encoded in fewer bits than the compressed tuples, and can be further compressed using an entropy code.
In the context of an online analytical processing (OLAP) star schema, this compression typically proceeds in three passes over the data. First, the system analyzes the data to determine an optimal coding of the values from each column so as to approach entropy. Then it joins the fact table and dimension tables to form a single “universal” relation and at the same time converts each tuple of the relation into a tuplecode, or concatenation of code-words, by dictionary-coding each field. The final pass involves sorting and delta-encoding the tuplecodes and writing out the encoded delta values. Decompression happens in the reverse order: the delta-coding is undone and individual fields are decoded as needed, using the dictionary.
As we noted previously, variable-length dictionary codes are essential to achieving acceptable compression ratios when compressing relational data. Unfortunately, variable-length codes are also a major source of central processing unit (CPU) overhead in today's compressed databases.
The compressor packs the individual compressed field codes of a tuple into a tuplecode. To access the i'th field of a tuplecode, the system must parse fields 1 through i−1 to determine their code lengths. This parsing creates control and data dependencies that severely impact performance on modern processors. Worse, these dependencies frustrate the goal of avoiding any decompression costs. We would like to avoid accessing the portions of the tuple that are not relevant to the query, but the cascading field offsets within a tuplecode force the system to compute many more code lengths than are necessary. Such overhead is a well-known problem in the prior art.
Determining cells for efficient compression of table data based on frequency partitioning suffer from several problems. The number of cells grows exponentially with the number of columns in the table, so the computational effort also grows exponentially. The number of columns also puts pressure on memory requirements, because the more columns are in the original table, the more frequency histograms have to be built. Furthermore the approach to scan all the data to build the frequency histograms requires that all data is processed twice: first for building the frequency histograms and later on for the actual compression. If the data volume exceeds the available memory resources, it has to be written to an external disk. After building the frequency histograms new available data cannot be encoded. A re-encoding is necessary.
In frequency partitioning mostly a fixed cell block size is used. The cells containing less-frequent values in one of the columns are often partially empty, so memory space is wasted.