Embodiments of the present invention are concerned with providing a computer-implemented method, a computer program product and a computer system that facilitate efficient data compression and subsequent data storage in a columnar database.
In modern relational database management systems, data is typically stored in compressed form in order to optimize the use of the available storage space, i.e. maximize the volume of data that can be stored in the database. To this end, well-known compression or encoding algorithms such as Lempel-Ziv, Huffman, LZ77, LZ78 algorithms and so on are used to compress (encode) the data to be stored in the database.
The factor by which the uncompressed data is compressed is sometimes referred to as the compression ratio of the data. A higher compression ratio corresponds to a more effective compression of the data. Therefore, it is desirable to maximize the compression ratio of the data when compressing the data for storage into the database.
Data compression is typically achieved by building a compression dictionary for the data, in which particular data strings are represented by particular bit patterns. In order to achieve a high compression ratio, short bit patterns are typically assigned to frequently recurring data strings in the data, with longer bit patterns used for less frequently recurring data strings in the data. This commonly requires evaluation of the full dataset to be stored in the database in order to determine the recurrence frequency of the various data strings, e.g. data entries, in the dataset. This can be a time-consuming exercise, which can account for up to 40% of the overall load process of the data into the database. Such large overhead can be undesirable, for example from a performance perspective.