A data storage system may store many billions of data objects. One challenge in creating such a system is to provide a way to locate and retrieve data objects efficiently. Typically each data object has a unique identifier that can be used to locate and retrieve the data object using some mapping from the unique identifier to the data object's location on the data storage system. One particularly good data structure for creating and maintaining such a mapping is a hash table.
A hash table is a data structure such as an array that is made up of a number of indexed elements (also known as slots or buckets). A data storage system using a hash table maps “keys” to corresponding “values” by performing a mathematical algorithm (a hash function) on the keys to calculate an “index” of the element in the hash table where the corresponding value should be located. Exemplary hash functions include the Secure Hash Algorithm (SHA-256). A key and its corresponding value are referred to as a “key/value pair.” In a data storage system, the key may be the unique identifier of a particular data object stored in the data storage system; and the value may be the location of the particular data object on the data storage system, or the value may be information used to derive the location of the particular data object such as a pointer. In such systems, performing a hash function on the unique identifier of a data object calculates the index in the hash table where the system may find the location (i.e. a memory address) of that data object so that it can be retrieved.
An element in a hash table may be empty or may contain some value indicating that it is empty (i.e. null or 0), meaning that there is no key/value pair using that location in the hash table. Otherwise, an element in a hash table may contain an entry or reference an entry. In “closed hashing” or “open addressing” systems, each element provides information about a single key/value pair. For example, in a closed hashing system, the element may contain a value. Adding entries to the elements of a hash table is referred to herein as “inserting” entries.
Locating and evaluating an element in the hash table may be referred to as “probing.” Systems generally evaluate whether another entry has been previously inserted into an element before inserting a new entry into that element because collisions may occur. Collisions may occur because the hash function can return the same index for two keys, and thus the system may attempt to insert a second entry into the same element in a hash table. Ideally, hash functions map each possible key to a different element in the hash table, but this is not always achievable. Although better hash functions minimize the number of collisions, when collisions do occur, the system may perform a collision resolution technique (or collision policy) to accommodate collisions.
Using a collision policy, when an element of the hash table corresponding to the calculated index has a previously inserted entry (or when the element contains an entry that is non-zero, non-null, or not empty depending on the implementation), the system probes the hash table to find an element without a previously inserted entry or with a zero, null, or empty entry (an “unused element”). When using linear probing, the system probes the elements in the hash table sequentially so that if the next sequential element is unused, the new entry is inserted into the next sequential element. Other collision resolution techniques include quadratic probing and double hashing.
One benefit to using a hash table data structure is that the look up time (the time to locate the value when the key is known) is generally constant regardless of the number of elements in the hash table. Therefore, a hash table implementation remains efficient even when the number of elements grows extremely large, making hash tables especially useful in large data storage systems for looking up the locations of the individual stored data objects. However, because a collision may have occurred during the insertion of entries, the system may have inserted an entry into an element in accordance with the collision policy (rather than into the entry corresponding to the index resulting from the hash function). Thus, when performing look up operations based on an index calculated from a hash function on a key, the entry at the calculated index may not belong to the key. The system must therefore perform additional steps to ensure that it has found the correct entry.
Hash tables may be stored in-memory (i.e. in RAM), but larger hash tables may require storage on-disk because on-disk storage is lower cost. However, storing hash tables on-disk can be undesirable when considering throughput because disk I/O operations are more costly than memory I/O operations.