1. Field of Invention
The present invention relates generally to the field of encoding. More specifically, the present invention is related to a method for compressing data with reduced dictionary sizes by coding value prefixes.
2. Discussion of Related Art
With the success of multi-core CPUs and data parallelism in GPUs, modern processors are once again gaining processing capability at the same rate that they are gaining transistors. However, access to memory has not been improving as quickly. This has led to a resurgence of interest in compression, not only because it saves disk space and network bandwidth, but because it makes good use of limited size memories and allows data to be transferred more efficiently through the data pathways of the processor and I/O system. The focus on using compression for increasing speed rather than capacity means that there is a new focus on the speed at which data can be decompressed. The speed of decompression is generally of more interest that the speed of compression because many data items are read mostly, that is they are compressed and written once, but read and decompressed many times.
A standard compression technique is dictionary coding. The idea is to place some frequent values in a dictionary, and assign them codewords, which are shorter (i.e., take fewer bits to represent) than the corresponding values. Then, occurrences of these values in a data set can be replaced by the codewords. This technique has been well studied and is widely used in different forms: a common example is Huffman Coding.
The drawback with this standard dictionary coding is that dictionaries can get very large when the domain has many distinct values. For example, if a column in a database of retail transactions contains dates, then some dates might be much more common than others (say weekends, or days before major holidays), so dictionary coding might be considered. But the number of distinct dates can run in tens of thousands.
However, dictionary based compression—the most common and effective form of compression—has a problem when running on modern processors: Compression replaces values on the sequentially accessed input stream with shorter codes that have to be looked up in a dictionary. This converts cheap sequential accesses from the input stream into what is arguably the most expensive instruction on a modern processor: a random read from main memory to access the value in the dictionary.
Random reads are expensive because they generally miss in the cache, access only a small portion of the cache line, and are not amenable to pre-fetching [see paper to Baer et al. “An effective on-chip preloading scheme to reduce data access penalty”]. One load instruction can cause the processor to stall for hundreds of cycles waiting for the result to arrive from memory, which, has been getting slower as CPUs have gotten faster. With simultaneous multi-threading (SMT) the processor can overlap its resources by running a second hardware thread while the first thread waits for the load to complete. This works fine if the miss rate on the threads is low, but if the thread miss rate is too high then all the threads can end up stalling for memory and the processor is forced to wait.
Random reads from the dictionary can be so expensive that it can become the bottleneck on performance. To remedy this, the size of the dictionary is limited so it can fit in the on-chip memory of the processor. The straightforward approach to limiting the size of the dictionary is to place only a subset of the values in the dictionary and to place any values missing from the dictionary directly in the input, in lined. While this simple approach improves the memory behavior, it decreases the compression ratio, making the compression far less effective and decreasing the data transfer speed. This is especially true when the on-chip memory is small.
Once the dictionary fits in the on-chip memory of the processor, the next bottleneck in processing compressed data is caused by mispredicted branches during decompression. This occurs because traditional Huffman code processing executes one or more unpredictable branches for each code it processes. A branch is unpredictable if it is taken or not taken with similar probability. Each mispredicted branch can take the processor dozens of cycles to recover (see, for example, the paper to Sprangle et al. entitled, “Increasing processor performance by implementing deeper pipelines”).
The entropy coding of data was one of the founding topics in Information theory, and Huffman's seminal work (in the paper entitled, “A method for the construction of minimum redundancy codes”) on entropy coding broke the ground for all that followed. Huffman codes separately encode each value as a bit string that can be compared and manipulated independently from the rest of the data.
Most work on fixed size dictionaries has been done in the context of online adaptive dictionary maintenance as embodied in LZ compression (see, for example, the paper to Lempel et al. entitled, “On the complexity of finite sequences”), where the dictionary is constantly changing in response to the data that has been compressed so far. In LZ compression and most of its variants, the dictionary is built over a window, a fixed size sequence of previously scanned characters that are searched during compression for repeating strings and retained during decompression to be used by the dictionary. A primary topic of interest in this field is online item replacement algorithms, which determines when to replace an existing dictionary entry with a new entry or when to change the Huffman code associated with an entry. This problem has been shown to be difficult (for example, see paper to Agostino et al. entitled, “Bounded size dictionary compression: Sck—completeness and nc algorithms”).
Length-limited coding is a generalization of fixed size dictionaries in which codes may have lengths between 1 and some integer L. One of the longest lasting contributions in the area is the paper to Larmore et al. entitled, “A fast algorithm for optimal length-limited Huffman Codes,” which introduces an O(nL) algorithm for computing optimal compression codes of bounded length L. In a more system-specific approach described in the paper to Liddell et al. entitled, “Length-restricted coding using modified probability distributions,” optimally restricts the length of compression codes to a size that fits in a word of the underlying architecture. Like the papers to Milidiu et al. (entitled, “Efficient implementation of the WARM-UP algorithm for the construction of length-restricted prefix codes” and “In-place length-restricted prefix coding”), they also provide heuristics that run significantly faster than the algorithm, as described in the paper to Larmore et al. entitled, “A fast algorithm for optimal length-limited Huffman Codes,” while suffering little loss in compression performance. An additional improvement to the fundamental algorithm is introduced by Turpin et al. in the paper entitled, “Practical Length-Limited Coding for large Alphabets,” which also teaches producing optimal results in the same running time using only O(n)space.
Online adaptive dictionary maintenance, albeit quite popular in some domains, is too expensive for many applications and has little gain in coding ability. Worst of all, it also interferes with parallelization as the adaptations have to be taken into account before the next item can be decoded.
In order to be able to manage very large datasets practically, it is often necessary to sample the input to produce a representative summary of its content. Moreover, it is crucial being able to do this efficiently in one pass, given the sometimes humongous size of current databases. Historically, two alternative sampling methods meet these requirements: reservoir (see paper to Vitter et al. entitled, “Random sampling with a reservoir”) and Bernoulli (see paper to Olken et al. entitled, “Simple random sampling from relational databases”) sampling, based on replacement and accept/reject policies, respectively. Recently, the paper to Germulla et al. entitled, “A dip in the reservoir: maintaining sample synopses of evolving datasets,” introduced a modification of reservoir-sampling to consistently manage online deletion of samples during sampling. In order to make an informed decision regarding the appropriate size of a dictionary, an estimation of the dataset cardinality is necessary. To compute a statistically reliable cardinality estimation, the paper to Markl et al. entitled, “Maxent: consistent cardinality estimation in action,” looks at the distribution of the uniform hash values of an input sampling and extrapolates the measured hash value density to the whole range of values of the hash function. Related to the specific problem of compressed databases, the paper to Srivatsava et al. entitled, “Consistent histogram construction using query feedback,” uses query feedback information to elaborate a dataset histogram based on the principle of maximum entropy, keeping its size always bounded.
Recently, Raman et al. (in the paper entitled, “How to wring a table dry: entropy compression of relations and querying of compressed relations”) use a delta coding of consecutive values in a column-oriented database in order to minimize the computations required for decompression, even allowing some operations to be performed on the uncompressed codes.
In U.S. Pat. No. 6,518,895, compressed symbols are parsed to determine in which class they belong, in order to choose from a set of class-specific look-up tables, thus reducing the required number of branch instructions. The PNG image compression format (see for example, the article by Hammel entitled, “Png: The definitive guide”) groups codes with similar length categories that will be accessed via a length identification code. Then a small number of bits determine the actual code length, allowing an efficient decoding.
Whatever the precise merits, features, and advantages of the prior art methods, none of them achieves or fulfills the purposes of the present invention.