Enterprise data is growing rapidly and enterprises are looking for ways to utilize big data to gain a competitive advantage. Databases have traditionally used data compression models to make data easily consumable. In general, the major barrier in data compression is that, as the compression ratio increases, so does the complexity; a higher compression ratio often means that it is more difficult to decompress the data. An optimal compression technique would be one that can provide a high compression ratio, but still not consume excessive computer resources to query the data for analytics.
Many conventional compression algorithms, such as null suppression, Huffman, and the like may provide 50 to 85% compression rates, but may be complex and the decompression of data may be a time intensive process. Nonetheless, these techniques may be efficient in reducing the I/O (input/output) overhead and hence may be suitable for I/O bound applications. In contrast, in the case of more light-weight compression techniques such as data dictionary compression, run-length, delta encoding, and the like, the I/O benefits of compression may substantially outweigh the associated processing costs.
Compression techniques have evolved from row-based compression approaches to column-based compression schemes, as column stores have natural redundancy in data values due to a recurrence of data values or patterns. Such techniques make decompression or consumption of data relatively easier. In column stores, a high compression ratio can be achieved compared to traditional row-oriented database systems. This may result in reduced storage needs, improved performance (for I/O intensive applications), and an increase in buffer pool/cache hit rate.
In the particular case of dictionary encoding, values of a column are encoded as integers. Thus, a check for equality, during scans or join operations, can be executed on integer parameters, which may be much faster than comparing string values. Furthermore, the dictionary encoded attribute vectors may be compressed using various techniques, like prefix encoding, run-length encoding, cluster encoding, and the like. In addition, the dictionaries themselves may be compressed through methods like delta encoding compression.