Database tables include several values for each database record. Storage of these values typically consumes large amounts of memory (e.g., disk-based and/or Random Access Memory). The memory required to store the values may be reduced by storing smaller value IDs instead of the values themselves. In order to facilitate such storage, a dictionary is used which maps values into value IDs. Each unique value in the dictionary is associated with one unique value ID. Therefore, when a particular value is to be stored in a database record, the value ID for the value is determined from the dictionary and the value ID is stored in the record instead.
The dictionary can be represented as a vector or radix tree of values, where each vector element/radix tree leaf entry at position i contains the value corresponding to value ID i. Before adding a new value to the dictionary, it must be ensured that the new value is not already present in the dictionary. However, linearly scanning all values in the dictionary will scale poorly as the dictionary grows. A secondary structure, or dictionary index, may be used to check for duplicates. The dictionary index may be, for example, a hash map or tree-based map from value to value ID.
For single-threaded encoding, the dictionary index is checked for the existence of the value and, if found, its value ID is returned. If the value is not found in the dictionary index, the value is inserted into the dictionary vector and into the dictionary index as a mapping from the value to a new index in the dictionary vector, which is equivalent to the new value ID, and the new value ID is returned.
For parallel encoding, a lock can be taken to protect the dictionary during dictionary encoding. This lock is computationally expensive if several threads try to access the same dictionary. Improved lock-free parallel dictionary encoding systems are desired.