This invention relates to memory systems and hash tables used in the memory systems. More particularly, this invention relates to shared-memory multiprocessor systems that utilize hash tables to facilitate data access.
Hashing is an efficient and popular technique for fast lookup of items based on a key value. Most research on hash tables has focused on programs with a single thread of control. However, many modern applications are multithreaded and run on multiprocessor systems. Server applications such as web servers, database servers, and directory servers are typical examples. Server applications often make use of one or more software caches to speed up access to frequently used items. The number of items in a cache may very greatly, both over time and among installations.
Hashing is often used to provide fast lookup of items in a cache. In this context, the hash table becomes a shared, global data structure that should grow and shrink automatically with the number of items and that must be able to handle a high rate of concurrent operations (insert, lookup, and delete), all without wasting memory.
The inventors have developed a hashing scheme designed to meet these requirements.
Some programmers still believe that hash tables have to be of fixed size. That is, the table size has to be determined in advance and stays fixed thereafter. In the late 1970s, several researchers proposed schemes that allowed hash files to grow and shrink gradually in concert with the number of records in the file. Two methodsxe2x80x94linear hashing and spiral storagexe2x80x94were subsequently adapted for main-memory hash tables. The system developed by the inventors uses hashing techniques that are based on linear hashing. Accordingly, spiral storage is not addressed in this disclosure.
A higher load on a hash table increases the cost of all basic operations: insertion, retrieval, and deletion. If performance is to remain acceptable when the number of records increases, additional storage must somehow be allocated to the table. The traditional solution is to create a new, larger hash table and rehash all the records into the new table. Typically, the new hash table is twice the size of the old one.
In contrast to the traditional solution, linear hashing allows a smooth growth in the table size. The table grows gradually, one bucket at a time, rather than doubling in size. When a new bucket is added to the address space, a limited local reorganization is performed. Linear hashing was developed by W. Litwin for external files (see, Linear Hashing: A new tool for file and table addressing, Proceedings of the 6th Conference on Very Large Databases (VLDB ""81), 1981, pgs. 212-223) and adapted to in-memory hash tables by P.-xc3x85. Larson (see, Dynamic Hash Tables, Communications of the ACM, Vol. 31, No 4, 1988, pgs. 446-457).
To briefly describe linear hashing, consider a hash table consisting of N buckets with addresses 0, 1 . . . Nxe2x88x921. Linear hashing increases the address space gradually by splitting the buckets in a fixed order: first bucket 0, then bucket 1, and so on, up to and including bucket Nxe2x88x921. When a bucket is split, about half of its records are moved to a new bucket at the end of the table.
FIG. 1 illustrates the splitting process in linear hashing for an example table 20 with five buckets (N=5). A pointer p keeps track of the next bucket to be split. FIG. 1 shows the table 20 at four different growth stages: A, B, C, and D.
At stage A, the first bucket 0 is split, with part of its records being transferred to new bucket 5. At stage B, the second bucket 1 is split and some of the records are moved to new bucket 6. Stage C shows splitting the fifth and last of the original bucketsxe2x80x94bucket 4 (i.e., Nxe2x88x921)xe2x80x94and migrating some of the records to new bucket 9.
When all N buckets have been split and the table size has doubled to 2N, the pointer p is reset to zero and the splitting process starts over again, as shown in stage D. This time, the pointer travels from 0 to 2Nxe2x88x921, doubling the table size to 4N. This expansion process can continue as long as is required.
FIG. 2 illustrates how each bucket is split. In this example, bucket 0 of the exemplary five-bucket hash table 20 is split. The hash table is illustrated both before and after expansion to a sixth bucket.
An entry in the hash table 20 contains a single pointer 24, which is the head of a linked list 26 connecting all records that hashed to that address. When the table is of size five (i.e., five buckets), all records are hashed by the function h0(K)=K mod 5. Once the table size has doubled to ten (i.e., ten buckets), all records will be addressed by the function h1(K)=K mod 10. However, as illustrated in FIG. 1, linear hashing allows the table to expand one bucket at a time rather than doubling the table size immediately.
For this example, consider keys that hash to bucket 0. Under the hashing function h0(K)=K mod 5, the last digit of the key must be either 0 or 5. The linked list 26 shows records that end in either 0 or 5. Under the hashing function h1(K)=K mod 10, keys with the last digit equal to 0 still hash to bucket 0, while those with the last digit equal to 5 hash to bucket 5. None of the keys hashing to buckets 1, 2, 3, or 4 under function h0 can possibly hash to bucket 5 under function h1.
To expand the table, a new bucket (with address 5) is allocated at the end of the table and the pointer p is incremented by one. The process scans through the records of bucket 0 and relocates to the new bucket 5 all records that hash to 5 under h1(K)=K mod 10. In this example, records xe2x80x9c345xe2x80x9d and xe2x80x9c605xe2x80x9d are transferred from bucket 0 to new bucket 5. The records in buckets 1-4 remain unaffected.
As the table size changes, record addresses for various records affected by the expansion (or contraction) also change. The current address of a record can, however, be computed quickly. Given a key K, a value for h0(K) is computed. If h0(K) is less than the current value of p, the corresponding bucket has already been split; otherwise, the corresponding bucket has not been split. If the bucket has been split, the correct address of the record is given by h1(K). When all original buckets 0-4 have been split and the table size has increased to ten, all records are addressed by h1(K), which becomes the new h0(K) for N=10.
The address computation can be implemented in several ways, but the following solution appears to be the simplest. Let g be a normal hashing function producing addresses in some interval [0, M], where M is sufficiently large, say, M greater than 220. To compute the address of a record, the following sequence of hashing functions is employed:
hi(K)=g(K)mod(Nxc3x972i), i=0, 1, . . . 
where N is the minimum size of the hash table. (It is noted that if N is a power of two, the modulo operation reduces to extracting the last bits of g(K).) The hashing function g(K) can be implemented in several ways. Functions of the type g(K)=(cK) mod M, where c is a constant and M is a large prime have experimentally been found to perform well. Different hashing functions are easily obtained by choosing different values for c and M.
The current state of the hash table is tracked by two variables:
L=number of times the table size has doubled (from its minimum size, N); and
p=pointer to the next bucket to be split, p less than Nxc3x972L.
Given a key K, the current address of the corresponding record can be computed as follows:
addr=hL(K)xe2x80x83xe2x80x831.
if(addr less than p) then addr=hL+1(K)xe2x80x83xe2x80x832.
Contracting the table by one bucket is exactly the inverse of expanding it by one bucket. First, the state variables are updated and then all records of the last bucket are moved to the bucket referenced by pointer p, and the last bucket is freed.
The discussion thus far has focused on how to expand or contract the hash table, but not when to do so. One way to determine when a hash table should undergo a size change is to bound the xe2x80x9coverall load factorxe2x80x9d of the table and to change table size when the overall load factor crosses over the bounds. The xe2x80x9coverall load factorxe2x80x9d is defined as the number of records in the table divided by the (current) number of buckets; i.e., the average chain length. A lower bound and an upper, bound are established for the overall load factor and table is expanded (contracted) whenever the overall load factor goes above (below) the upper (lower) bound. To support this decision mechanism, the hash table must track the current number of records in the table, in addition to the state variables L and p.
As noted above, many modern applications are multithreaded and run on shared-memory multiprocessor (SMP) systems. There are many challenges in constructing a scaleable hashing mechanism that accommodates the needs of this type of applications. Among the main challenges are reducing lock contention so that many threads can access the same hash table concurrently and reducing cache misses to improve overall access speed and performance.
Lock Contention
In multithreaded applications, many threads need access to the same hash tablet concurrently. Problems arise in that concurrently accessing threads can disrupt one another. One thread may cause the hash table to change or scale in size, while another thread is in the process of using the table for its own purpose.
One conventional approach to avoiding this problem is by using a single, global lock that protects all access to the table. When a thread gains access to the table, lit locks the table so that no other thread can use the table until the first thread is finished. The single lock serializes all operations on the table so that they cannot possibly interfere with each other.
For multithreaded applications with many concurrent threads, serialized operation on the hash table easily becomes a bottleneck resulting in poor scalability. This bottleneck restricts or even negatively impacts an application""s ability to scale. That is, adding more processors may not only fail to increase throughput but may sometimes even decrease throughput.
The inventors have developed a scaleable hash table that permits many operations on the hash table to proceed concurrently, resulting in excellent scalability.
Cache Miss Problems
All modern CPUs rely on multilevel processor caches to bridge the latency gap between memory and processors. The cost of a complete cache miss is substantial in the cycles wasted while the processor is stalled waiting for data to arrive from memory. Today, it is already as high as 100 cycles on some processors and it is expected to get worse.
Accordingly, there is a need to improve performance by reducing the number of cache misses. The inventors have devised a hash table that utilizes a cache-friendly data structure and hashing algorithm to reduce cache misses and page faults.
This invention concerns a scaleable hash table that supports very high rates of concurrent operations on the hash table (e.g., insert, delete, and lookup), while simultaneously reducing cache misses. The hash table is designed to meet the requirements of multithreaded applications running on shared-memory multiprocessors (SMP), but may be used by any application.
According to one implementation, the hash table is used by an application running on a shared-memory multiprocessor system with a memory subsystem and a processor subsystem interconnected via a bus structure. The processor subsystem has multiple microprocessors that are coupled to share the data resources on the memory subsystem. Each microprocessor has a central processing unit and cache memory (e.g., multilevel L1 and L2 caches).
The memory subsystem is a hierarchical memory system having a nonvolatile main memory (e.g., nonvolatile RAM) and persistent stable storage (e.g., disks, tape, RAID, etc.). A hash table is stored in the main memory to facilitate access to data items kept in the main memory or stable storage. The hash table consists of multiple buckets, where each bucket comprises a linked list of bucket nodes that hold references to data items whose keys hash to a common value or address. A suitable hashing function can be selected to provide an approximately even distribution of data items across the buckets.
Individual bucket nodes contain multiple signature-pointer pairs that reference corresponding data items. Each signature-pointer pair has a hash signature computed from a key of the data item and a pointer to the data item. The number of signature-pointer pairs in one bucket node is selected so that they fill one or more cache lines exactly when stored in the processors cache. In one implementation, each bucket holds seven signature-pointer pairs, which can be loaded conveniently into two 32-byte cache lines. In another implementation, three signature-pointer pairs were found to be effective.
When using two cache lines, the pairs are laid out in an arrangement that places the signatures in the first cache line and the pointers in the second cache line. The first cache line also holds a (spin) lock and the second cache line also stores a pointer to the next bucket node on the list.
The hash table is configured to store the first bucket node in the linked list for each of the buckets. Thus, the multiple signature-pointer pairs for the bucket are kept in the hash table, rather than just a pointer to the first node. This helps reduce potential cache misses.
To enable high rates of concurrency, while serializing access to sections of the table, the hash table utilizes two levels of locks: a higher level table lock and multiple lower level bucket locks. The table lock is held just long enough for a thread to set the bucket lock of a particular bucket. Once the table lock is released, another thread can access the hash table and any one of the non-locked buckets. In this manner, multiple threads can be conducting concurrent operations on the hash table (e.g., insert, delete, and lookup).
In another implementation, the hash table is further partitioned into multiple separate subtables, where each subtable is itself a linear hash table as described above. In this implementation, multiple subtable locks are used to govern access to the subtables, with underlying bucket locks governing access to individual buckets.