The present invention relates in general to data compression and data encoding. In particular, the present invention relates to generating occurrence information for data values in a data set to be encoded or compressed.
Data compression is an important aspect of various computing and storage systems. While data warehouses are discussed in some detail as an example of systems where data compression is relevant, it is appreciated that data compression and efficient handling of compressed data is relevant in many other systems where large amounts of data are stored. In general, data warehouses are repositories of an organization's electronically stored data, which are designed to facilitate reporting and analysis.
The effectiveness of data warehouses that employ table scans for fast processing of queries relies on efficient compression of the data. With adequate data compression method, table scans can be directly applied on the compressed data, instead of having to decode each value first. Also, well designed algorithms can scan over multiple compressed values that are packed into one word size in each loop. Therefore, shorter code typically means faster table scan. The following compression methods are well-known. Dictionary based compression encodes a value from a large value space but relatively much smaller set of actual values (cardinality) with a dictionary code. Offset based compression compresses data by subtracting a common base value from each of the original values and uses the remaining offset to represent the original value. The prefix-offset compression encodes a value by splitting its binary representation into prefix bits and offset bits, and concatenates the dictionary code of the prefix bits with the offset bits as the encoding code.
One of the most important criteria for compression efficiency is the average code length, which is the total size of compressed data divided by the number of values in it. One way of achieving better compression efficiency, i.e. smaller average code length, is to encode the values with a higher probability with a shorter code.