This invention is directed to a novel hashing technique, i.e., a technique for mapping from a larger domain to a smaller range. The invention is more particularly directed to a hashing technique which requires a relatively small number of hashing functions to map data from a domain A to a range B while minimizing the probability of two different data points in A being mapped to the same point in B.
The concept of hashing is well-known and is employed in a variety of applications where it is necessary to map a large set of data points into a smaller set of data points. The invention will be described primarily in the context of data storage and retrieval, although it should be remembered that hashing in general, and the invention in particular, have many other useful applications.
In one example of the use of hashing, there may be a large memory and a smaller cache, and it may be desirable to selectively store the contents of certain memory addresses in the cache. When storing data in the cache, a hashing function may be used to generate a cache address at which the data is to be stored. For example, if the cache has a storage capacity of 256 words and the larger memory stores 4,096 words, the addresses in the cache can be specified with 8 bits of address while the addresses in the larger memory can be specified with 12 bits of address. One simple type of hashing function would be to use the first 8 bits of the larger memory address as the address in the cache. In this case, there will be sixteen addresses in the larger memory capable of being mapped to any given address in the cache. If an address from the memory is hashed to an address which is already occupied, some additional steps must be taken in an attempt to find a vacant storage location in the cache. It would be desirable to have a hashing function which would minimize such "collisions".
If the selection of data from the larger memory is random, and the number of elements already placed in the cache is significantly smaller than the size of the cache, then the probability of a new memory address colliding with a previous address is relatively small. In many cases, however, data selection is not random but instead consecutive address from the larger memory may often be selected. Since there is a relatively high probability that consecutive addresses in the larger memory will have the same hashing address in the above example, the frequency of collisions would be unacceptable.
Substantial efforts have been made to provide hashing techniques which will maintain a relatively low probability of collision regardless of the input distribution. In our paper entitled "Universal Classes of Hash Functions", Journal of Computer and System Sciences, Vol. 18, No. 2, April 1979, pages 143-154, we have proposed an approach wherein a set of hash functions are provided instead of just one. A hash function is selected from the hash function set each time an application is run, or less frequently if desired. If the set of hash functions is carefully chosen to be a universal.sub.2 class, good performance can be obtained regardless of the input distribution. The term "universal.sub.2 " is explained in detail in the above-referenced paper. Briefly, a set of hash functions will be considered universal.sub.2 if the probability of any two data points colliding is less than or equal to 1/b, where b is the size of range B. In other words, no pair of data points will collide under more than 1/b of the hash functions in the set.
A problem with the hashing technique described above is that known universal.sub.2 classes typically contain a fairly large number of hash functions. For instance, a universal.sub.2 class capable of hashing n-bit long names will typically contain 2.sup.O(n) functions, and 0(n) bits are therefore required to specify the selected hashing functions, where 0(n) in general designates a linear function of n. In some instances, e.g., when using hashing to generate authentication tags, the system becomes impractical if a large number of bits are required to specify the hashing function.
In our subsequent paper entitled "New Hash Functions and Their Use in Authentication and Set Equality", Journal of Computer and System Sciences, Vol. 22, No. 3, June 1981, pages 265-279, we have proposed a set of hashing functions which is "almost" strongly universal.sub.2 and is much smaller, such that only m.multidot.log(n) bits are required to specify the chosen hashing function, where m is the length of the output. While this improvement can make some applications of hashing practical, it would be desirable to provide an even smaller class of hashing functions which will achieve satisfactory results. It is sometimes necessary to hash data strings which are long enough that even m.multidot.log(n) can be an excessively large number.