Dictionary encoding is a technique used to compress data and minimize accesses to memory within data storage and management systems. According to this technique, strings or other data items are replaced with smaller tokens, referred to as dictionary codes. Each dictionary code maps to a corresponding entry in a data structure (the “dictionary”), where the replaced data item is stored. For example, a database table may have multiple recurring string values for a particular attribute/column. Instead of storing multiple instances of a string value each time it appears in a column, a single instance of the string value may be stored in the dictionary. Each instance of the string value may then be replaced with a dictionary code identifying the position of the string value within the dictionary. Accordingly, the overhead associated with storing and accessing the table may be significantly reduced, especially where there is a frequent rate of recurrence for large string values.
Dictionary-encoded columns in columnar databases may be stored horizontally such that codes are contiguously and tightly packed together in memory. For example, if each code consumes five bits, then the first five bytes of a column may yield eight codes. Thus, there is no bit spacing or padding between the codes, and each code is stored in contiguous order. This format allows for relatively efficient storage and retrieval of dictionary codes within the column.
Another format for storing dictionary-encoded columns is to lay the codes out vertically in slices, where each slice is a bit-vector including bits corresponding to the same position in all codes. Vertically storing dictionary-encoded columns may allow for faster comparison scans than the horizontal format. However, certain operations may require one or more codes to be fully reconstructed so that they may be used as indexes into the dictionary to obtain the decoded value. Reconstructing dictionary codes may be an expensive operation that negates the benefits of having dictionary-encoded columns stored in this format.
In order to overcome the drawbacks of the vertical storage format, one approach is to maintain two separate copies of the column—one with vertically-stored dictionary codes and another with horizontally-stored codes. During scan operations, the vertically-stored dictionary codes are used, and during index-based operations, the horizontally-stored codes are accessed. While faster scans may be achieved without the need to reconstruct the codes for index-based operations, maintaining two copies of a column significantly increases storage and maintenance requirements for the column.
In other approaches, a single-instruction, multiple data (SIMD) architecture is used to evaluate expressions on dictionary codes. According to such approaches, SIMD instructions are used to extract multiple dictionary codes into separate partitions (also referred to as elements) within SIMD registers. In a single instruction cycle, all values within the SIMD registers are compared with a target value. Thus, data level parallelism may be exploited to significantly decrease the processing time associated with performing scans and other operations involving codes from the dictionary. However, current SIMD instruction sets in modern architectures do not operate easily at the bit-level, as SIMD registers are partitioned at the byte-level. Separating codes of different widths out into SIMD register elements may result in added expense and wasted resources. For example, each five-bit code may be deposited into a sixteen-bit element in a SIMD register, resulting in eleven of the sixteen bits of each element being “wasted” during processing. Thus, the SIMD registers are not fully utilized during operations on dictionary codes, such as scan and index-based operations.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.