Data compression, commonly used in database systems provides multiple benefits, including reduced disk and memory requirements and reduced data transfer bandwidth. It also highlights properties of data that can be used to execute queries more efficiently by processing compressed data without de-compressing it. Different solutions for such compressed execution have been proposed but most focused on different forms of data encoding, reducing processing time for individual records. As such, they did not exploit opportunities provided by databases that work on multiple records in one processing stage by processing data on a vector or block granularity as discussed in Marcin Zukowski. Balancing Vectorized Query Execution with Bandwidth-Optimized Storage. PhD thesis, Universiteit van Amsterdam, 2009 (hereinafter “Zuk09”.) The only well-known method that operates on multiple compressed records with identical values at the same time is described in an article by Daniel J. Abadi, Samuel Madden, and Miguel Ferreira, “Integrating Compression and Execution in Column-Oriented Database Systems” in Proceedings of the 2006 ACM SIGMOD international conference on Management of data, 2006. This article focused on processing data that was compressed with many different techniques. The article described the use of special objects representing a portion of compressed data (for example, in case of run-length encoding—a single run, that is, a sequence of identical consecutive values, in case of dictionary compression—a single encoded value and dictionary handle). The article then described re-implementing the database kernel so that it would use methods of these objects instead of accessing raw data. One of the compression methods used was run-length encoding (RLE), which—in some situations—allowed operating on the entire run of consecutive identical values at the same time.
The solution proposed by Abadi was mostly intended for systems that process data tuple-at-a-time. In order to process RLE-compressed data in a similar fashion in a vector-at-a-time system, the system would need to represent data in vectors as collections of runs of RLE-compressed data. The conversion of all of the operators to work on such collections would be highly cumbersome and require major adaptation of the database engine. Moreover, in case of RLE-compressed data, the solution of Abadi et al. can be inefficient for short runs because the per-run overhead would be amortized over a small number of tuples, significantly increasing the per-record cost (compared to operating directly on “raw” data), and there is little opportunity for performance improvement. Hence, if RLE-compressed data has a large variation in run lengths, then any speedup of the processing achieved for long runs might be hindered by slowdown for short runs. These overheads are especially visible in block-oriented processing systems (where most of other per-record overheads are eliminated). This means that the overheads of RLE-compressed execution might reduce or eliminate the benefits of block- or vector-oriented processing.
Thus, it is desirable to provide a system and method that allows exploitation of identical-value runs, like in RLE-compression, without incurring a significant processing overhead during data processing and it is to this end that the disclosure is directed.