This invention relates to information storage and retrieval systems, and, more particularly, to the use of hashing techniques in caching systems.
Techniques for caching frequently-used data have been used for many decades, and provide fast access to information that would otherwise require long retrieval times or lengthy computation. A cache is a storage mechanism that holds a desired subset of data that is stored in its entirety elsewhere, or data that results from a lengthy computation. Its purpose is to make future accesses to a stored data item faster. A cache is usually dynamic in nature: items stored in it may not reside there permanently, and frequently those items whose future value is questionable are replaced by items predicted to be more valuable. Typically, but not exclusively, older items are replaced by newer ones. Successful application of caching can be found in the routing caches used by Internet servers to provide quick real-time access to network routing information, for example, though its general usefulness makes it a popular technique used in many other areas of computing as well. Since particular data items stored in the cache may become less valuable over time, based on the state of the system or for reasons, a mechanism that allows for the effective, efficient, and flexible removal of less valuable data is beneficial.
Records stored in a computer-controlled storage mechanism such as a cache are retrieved by searching for a particular key value among stored records, a key being a distinguished field (or collection of fields) in a record, which is defined to be a logical unit of information. The stored record with a key matching the search key value is then retrieved. Though data caching can be done using a variety of techniques, the use of hashing has become a popular way of building a cache because of its speed advantage over other information retrieval methods. Hashing is fast compared to other information storage and retrieval methods because it requires very few key comparisons to locate a requested record.
Hashing methods use a hashing function that operates on—technical term is maps—a key to produce a storage address in the storage space, called the hash table, which is a large one-dimensional array of record locations. This storage address is then accessed directly for the desired record. Hashing techniques are described in the classic text by D. E. Knuth entitled The Art of Computer Programming, Volume 3, Sorting and Searching, Addison-Wesley, Reading, Mass., 1973, pp. 506-549, in Data Structures and Program Design, Second Edition, by R. L. Kruse, Prentice-Hall, Incorporated, Englewood Cliffs, N.J., 1987, Section 6.5, “Hashing,” and Section 6.6, “Analysis of Hashing,” pp. 198-215, and in Data Structures with Abstract Data Types and Pascal, by D. F. Stubbs and N. W. Webre, Brooks/Cole Publishing Company, Monterey, Calif., 1985, Section 7.4, “Hashed Implementations,” pp. 310-336.
Hashing functions are designed to translate the universe of keys into addresses uniformly distributed throughout the hash table. Typical hashing functions include truncation, folding, transposition, and modulo arithmetic. Cyclic redundancy check (CRC) functions are occasionally used as hashing functions.
A disadvantage of hashing is that more than one key will inevitably translate to the same storage address, causing collisions in storage. Some form of collision resolution must therefore be provided. Resolving collisions within the hash table itself by probing other elements of the table is called open addressing. For example, the simple open-addressing strategy called linear probing views the storage space as logically circular and consists of sequentially scanning in a forward direction from the initial storage address to the first empty storage location.
Another method for resolving collisions is called external chaining. In this technique, each hash table location is a pointer to the head of a linked list of records, all of whose keys map under the hashing function to that very hash table address. The linked list is itself searched sequentially when retrieving, inserting, or deleting a record, and insertion and deletion are done by adjusting pointers in the linked list. External chaining can make better use of memory than open addressing because it doesn't require initial pre-allocation of maximum storage and easily supports concurrency with the ability to lock individual linked lists.