In various computer applications, particularly those requiring processing of a large number of data elements (e.g., hundreds, thousands, or even millions of records), hash tables are typically used to store and/or access the data elements (collectively called data). In general, the hash tables provide faster, more efficient access to the data compared to storing/accessing the data using simple structures or arrays. The origins of hash tables date back to the 1960s, when authors of compilers used the concept of hashing as a mechanism for maintaining lists of user-defined names and associated properties. These initial developments lead to the formalization of several classic techniques used today to implement hash tables. These techniques include separate chaining, open addressing (linear, quadratic probing, double hashing), etc.
A hash table is usually implemented using a hash function, which receives a key, i.e., an attribute of the data element to be stored or accessed. The hash function generates an index (e.g., a number) into the hash table, the index corresponding to the key received. The data element is then stored in a “bucket” (e.g., a set of memory locations) identifiable by the index. The hash function may generate the same index for more than one key, each key corresponding to a different data element. As a result, more than one data element may be stored in one bucket. The time required to store a data element in a bucket, or to access the data element from a single bucket depends, in part, on the number of data elements in the bucket. In general, the more elements in the bucket, the more the storage and/or access time.
Typically, it is desirable to configure hash tables and hash functions such that the hash function generates, for several keys received, indices that are substantially uniformly distributed over the hash table. Therefore, each bucket may hold similar number of data elements. As a result, once a bucket corresponding to a data element is identified by the hash function, that data element can be stored in the identified bucket or accessed therefrom relatively efficiently compared to storing/accessing the data element from a bucket that contains significantly more elements than other buckets.
Various additional techniques such as bloom filters, d-left hashing, cache awareness, and lock-free algorithms have also been developed recently in order to improve the performance (e.g., access time, memory footprint, etc.) of the hash tables. Bloom filters, for example, can reduce the number of memory accesses of a hash table while locating a data element stored therein. Two-way chaining can reduce with high probability the size of the fullest bucket in a chaining-based hash table. By doing so, the maximum access time to the hash table is also reduced with high probability. The “always-go-left” algorithm may further improve the performance of two-way chaining technique by introducing asymmetry. The two-way chaining technique has been generalized as d-way chaining, and another hashing mechanism called d-left hashing was developed by combining d-way chaining and the always-go-left algorithm.
While the theoretic work on hash tables generally focuses on the algorithmic aspects of the problem (e.g., substantially even distribution of keys over the hash table), the specific details of the implementation of the hash table can play a crucial a role in determining the bottom-line performance of a computer system. One such important aspect of the implementation is the interaction between software and hardware resources and, specifically, the memory access patterns derived from stressing the hash table. To minimize expensive memory accesses, it is desirable that the system is designed or tuned to maximize cache hit ratios, which is an important system-performance parameter.
To this end, hash data structures that can fit in hardware cache lines have been proposed. For instance, in separate chaining (e.g., d-way chaining), hash tables can be tuned to have only a certain maximum number of collisions per entry (i.e., distinct keys that may be mapped by the hash function to the same bucket) that is less than or equal to the number of elements that fit into a cache line. If a bucket of the hash table to which a key is mapped is not located in the cache, one cache line may be accessed from the main memory, requiring one memory access. Because the bucket size is less than or equal to the size of the cache line, the entire bucket containing the required data element (or its designated memory location in the bucket) would now be available in the cache. As a result, hash-table operations such as put (i.e., store), get (i.e., access, read, etc.), and remove, may require only one memory access. The d-left hashing technique has been used to tune the maximum number of collisions per hash table entry in order to fit a bucket into a single cache line. The use of Bloom filters can avoid memory accesses for those get and remove operations invoked on elements that are not in the hash table.
Many modern computer systems include more than one processing units (e.g., processors, cores, etc.). In such systems, two processing units may simultaneously access a hash table or a location therein, and may seek to store or modify data at that location at the same time, which can cause data corruption. Two common hardware approaches to achieve concurrent access to a hash table while preventing data corruption include locking and using atomic operations such as compare and swap (CAS).
In locking, the entire hash table or small subcomponents thereof (e.g., buckets in a separate chaining-based table) are locked, i.e., when one processor is granted access to the hash table or a portion thereof, all other processors that may otherwise access the hash table are temporarily denied such access until the processor to which the access was granted completes its operation. Though locking may prevent data corruption, it can impose significant performance penalties on a system that includes several processors because each processor, before accessing a location in the hash table, must first determine whether that location is accessible at that time. The performance of systems employing locking can degrade even more as the number of processors or cores in the system increases because locking tends to scale poorly with large number of cores due to excessive amount of locking contention.
The CAS-based methods are sometimes also called “lock-free” methods, because they do not require locking a hash table or portions thereof. As one of ordinary skill in the art would appreciate, however, the CAS-based methods are, in fact, memory-based locking methods, requiring a lock to be maintained in the memory. Therefore, before accessing a location in the hash table, each processor/core in a multi-processor/core system must check the lock in the memory. As such, typical CAS-based methods still require some degree of memory contention, though that contention usually occurs at a table-entry level, rather than at a bucket or table level as is the case with lock-based methods. Though the check can be performed using a fast atomic operation such as CAS, the checking requirement can still impose performance penalties, and may limit scaling of the multi-core/processor system.