Database systems typically store database objects (e.g. tables, indexes, etc.) on disk, and load data items from those database objects into volatile memory on an as-needed basis. Once loaded into volatile memory, the data items may remain cached in volatile memory so that subsequent accesses to the same data items will not incur the overhead of accessing a disk. Those data items may be replaced in cache, for example, to make room in volatile memory to store other data items that have been requested.
Rather than load individual data items on a per-item basis, entire database objects, or portions thereof, may be pre-loaded into volatile memory. Various approaches for loading entire database objects, or selected portions thereof, into volatile memory to speed up database operation processing are described in U.S. Pat. No. 9,292,564, entitled “Mirroring, In Memory, Data From Disk To Improve Database operation Performance”, the application for which was filed Jul. 21, 2014, referred to herein as the “Mirroring” application, the contents of which are incorporated herein in their entirety.
According to the approaches described in the Mirroring application, database objects, or portions thereof, are stored in volatile memory in a different format than the format that those same objects have on disk. For example, the in-memory copies of the objects may be stored in a column-major format, while the on-disk copies are stored in a row-major format. The format used by the on-disk copies is referred to as the “persistent format”, while the format used by the in-memory copies is referred to as the “mirror format”. An in-memory version or copy of an object (or selected portions thereof), is referred to herein as an In-Memory Columnar Unit (IMCU).
To reduce the amount of volatile memory consumed by IMCUs, the mirror format is often compressed. One way to compress an IMCU is through the use of dictionary encoding. For example, assume that an IMCU stores a contiguous array of values from a column c1 of a table T1. Such an array is referred to herein as a “column vector”. To reduce the volatile memory required to store such a column vector, the actual values from the table may, within the column vector, be replaced with values that take less space (“codes”). For example, a character string of 30 characters may be replaced by a four-byte code. The column vector produced by making such replacements is significantly smaller than a column vector that contains the original values.
While dictionary encoding significantly reduces the size of column vectors, it is necessary, when processing database commands based on information from the column vectors, to be able to reproduce the actual values of the column. Therefore, a dictionary is maintained in conjunction with a column vector. The dictionary associated with a column vector contains a replacement-value-to-actual-value mapping for each unique value in the original set of values. For example, assume that column c1 stores the names of states, and that table T1 has one million rows. The dictionary for the compressed column vector for c1 would have one entry for each unique state name that is present in c1. Since there are only 50 states, the dictionary for the compressed column vector for c1 would have at most 50 entries.
The process of encoding a column vector typically involves (a) reading the actual values from the target column, (b) eliminating duplicates to create a list of unique values, (c) and (d) assigning a code to each unique value. Once a code has been assigned to each unique value, the values within a column vector are replaced with the corresponding codes. Because the codes are smaller than the values the codes replace, the column vectors that contain the codes are effectively “compressed”.
Decompressing a compressed column is referred to as “materializing” the column. Decompressing a column vector that was compressed using dictionary encoding involves performing code-to-value look-ups using the dictionary by which the column vector was encoded. To speed up code-to-value look-up operations, the position of the unique value's entry in the dictionary may be treated as the code for that unique value. For example, the value associated with the first dictionary entry may be assigned a code of “1”, the value associated with the second dictionary entry may be assigned a code of “2”, etc. To illustrate how assigning codes that correspond to the entry positions within the dictionary reduces the time required for code-to-value look-up operations, assume that the database server needs to determine the value that corresponds to the code “7”. To determine the value, the database server can simply use “7” as an index into the dictionary to locate the seventh dictionary entry. The 7th dictionary entry contains the actual value, from c1, that corresponds to the code “7”.
To speed up value-to-code look-up operations, the values are stored in the dictionary in sorted order. By storing the values in the dictionary in sorted order, the code for a target value may be ascertained by performing a binary search on the dictionary. Upon finding the target value within an entry in the dictionary, the database server simply uses the position, within the dictionary, of the entry that has the target value, as the code for the target value. Thus, if the binary search finds target value in the 7th entry of the dictionary, then the code for the target value is “7”.
Unfortunately, it is often impractical to have a single column vector for a column. For example, if column c1 has several million rows, even the compressed column vector for the column may be too large to be efficiently stored in a single IMCU. Therefore, it is common for one column of a table to be divided up into multiple column vectors, each of which covers a range of rows from the actual table. For example, assume that column c1 has 3000 rows. Column c1 may be mirrored in volatile memory using three column vectors CV1, CV2 and CV3, where CV1 covers the values of c1 from row 1 to row 1000, CV2 covers values of c1 from row 1001 to row 2000, and CV3 covers values of c1 from row 2001 to row 3000. Each of these column vectors may be stored, within the volatile memory of the same node, in separate IMCUs. For the purpose of explanation, it shall be assumed that the IMCUs to which column vectors CV1, CV2 and CV3 belong are IMCU1, IMCU2, and IMCU3, respectively.
In the scenario described above, because each of column vectors CV1, CV2 and CV3 is in a separate IMCU, each of column vectors would be encoded using a different dictionary. For the purpose of explanation, the dictionary used to encode column vectors CV1, CV2 and CV3 shall be referred to as DCT1, DCT2 and DCT3, respectively. DCT1, DCT2, and DCT3 are IMCU-specific. Specifically, the dictionaries are not interchangeable because, for example, column c1 of rows 1-1000 (CV1) may have some values that column c1 of rows 1001-2000 (CV2) does not have, and visa-versa. Thus, DCT1 would have entries for values for which DCT2 would not have entries, and visa-versa. Further, even for the values that are common to both CV1 and CV2, the codes to which those common values are mapped may be different. For example, DCT1 may map “California” to code “2”, while DCT2 maps “California” to code “4”.
Often, database commands require processing all rows of a table based on values in a specified column, not simply those rows that belong to a single column vector. For example, a query may have the predicate “where c1=‘California’ or ‘Maine’”. When processing such commands, the database server creates various temporary data structures based on the mappings in DCT1 to apply the predicate “where c1=‘California’ or ‘Maine’” to the values in CV1. Then, those temporary data structures are discarded, and new temporary data structures are created based on the mappings in DCT2 to apply the predicate “where c1=‘California’ or ‘Maine’” to values in CV2. Finally, those temporary data structures are discarded, and new temporary data structures are created based on the mappings in DCT3 to apply the predicate “where c1=‘California’ or ‘Maine’” to values in CV3.
Not only is the rebuilding of those temporary data structures time consuming, but the dictionaries themselves may be highly redundant. For example, DCT1, DCT2 and DCT3 may each have entries for all or nearly all of the 50 states. The redundant storage of the same values (which can be quite large) in multiple dictionaries can consume a significant amount of volatile memory. For example, a significant amount of memory would be consumed by five IMCU-specific dictionaries, where each of the five IMCU-specific dictionaries established a different mapping for nearly the same set of one million large unique values.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.