The present invention relates to a memory architecture optimized for hashing, and to a method of organizing data which is optimized for hashing, and more particularly optimized for carrying out the Lempel-Ziv data compression algorithm.
The problem of organizing and storing data in a manner to allow quick retrieval is constantly encountered. A large variety of techniques known as hashing have been used to solve tile data organization and retrieval problem.
Frequently, a data element will have a value range that makes retrieval by searching by simple look-up impracticable. For example, a small five character ASCII word can have over a trillion possible values, and would require an equally large space of memory addresses if every possible value were allocated a unique address in the memory address space.
Hashing involves the application of a many-to-one function to the data element to map it from a larger to a smaller address space. The smaller address space will be the space of memory addresses in which the data is to be stored. A data element, called a key, is used as the argument of the many-to-one function, called the hash function. Storage of information associated with the key is accomplished by computing the hash function and storing the key, and associated data, in a memory at an address, called the hash address, corresponding to the hash function value. Because of the many-to-one property of the hash function, a smaller address space is required than if each of the possible data elements were to be assigned a unique address in memory. Retrieval of stored data associated with a particular key then involves simply computing the hash function value for the key and reading out the stored data from the memory.
Because of the many-to-one mapping by the hash function, more than one key can be hashed to the same memory address. This condition is called a collision, and a large number of techniques have been developed for resolving collisions. Good summaries of hashing techniques including collision resolution can be found in D. E. Knuth, The Art of Computer Programming, Vol. 3: Sorting and Searching (1973) and J. S. Vitter, Analysis of Coalesced Hashing, Stanford University Department of Computer Science, Report No. STAN-CS-80-817, August 1980, reprinted in part at J. S. Vitter, "Tuning the Coalesced Hashing Method to Obtain Optimum Performance", Proceedings of Twenty-first Annual Symposium on Foundations of Computer Science, October 1980.
Many of the collision resolution methods utilize a technique called chaining. In the chaining method a dynamic linked list is maintained at each memory address corresponding to a hash function value or hash address. The hash addresses are stored in a table. The key and associated data is organized as a record which includes a key field, fields for data associated with the key, and a link field. After hashing a key, if there is a collision a sequential search is performed in the list to locate the particular record containing the key value being searched. If the list terminates without the key being found, and if it is desired to add the key and any associated data to the stored data, a new list element is added in the form of a record with that key as its content.
The chaining method has two principal advantages and two drawbacks. Because on the average the linked lists associated with each of the hash addresses are small, the search time for data retrieval is short. This remains true even as the hash address table begins to fill up. Knuth (cited above) has shown that even width the table completely full the chaining method requires only about 1.5 to 1.8 probes in order to locate the key. In addition, the hash address of the key is uniquely associated with it, that is, a key always hashes to the same hash value even though other keys may hash to that same value, and does not depend on the order in which the key is inserted in the hash address table. This uniqueness allows the use of abbreviated keys and a resultant reduction in storage requirements and in search time.
One possible problem with the chaining method is that the overhead of the link field could create severe memory requirements. A second problem is that an additional block of memory, used as overflow memory, is required in order to ensure adequate storage of the linked lists associated with each of the hash addresses. These two problems are intertwined.
If, because of the nature of the data, or because of Foot hash function properties, many keys are hashed to the same hash address, the overflow memory will have to be big enough to store a substantial part of the entire key space. Even for well behaved hash functions, Vitter (cited above) has shown that the overflow memory can increase the overall memory requirements from about 37% to 100%.
Hashing efficiency can be increased by expanding the memory available for hash addresses. The larger the hash table that can be used for a given number of entries, the less the likelihood of a collision. However, because each hash address requires an associated link field, increasing the memory available to the hash table will expand the memory requirements by more than just the additional memory allocated to the hash table.
It is accordingly an object of the invention to provide a memory architecture and a method of data organization optimized for hashing which permits collision resolution by chaining and which also minimizes the amount of memory which must be allocated.