The generation of hash values is a technique used in many areas of data processing, such as database addressing and data encryption. The use of hashing techniques will now be introduced with reference to the particular example of address management in databases.
Data are commonly stored in an organized manner known as a database under control of a data processor. By means of this data processor a user is able to browse or search through the data, or to insert, amend or delete data as required. Although the user has no need to be aware of the exact manner in which the data are stored, it is important for speed of operation that the data processor is able to locate and retrieve a particular data item as accurately and quickly as possible. In practice this requires that the data processor is able to locate a desired data item directly, without having to search through the entire store of data.
In general each data item may comprise two sections: a short key identifying the item, and the remainder of the information associated with that key. For example, a particular database may use a person's name as a key to identify a larger portion of data containing further details about that person. For small data items the entire item may be used as the key. From the information in the key alone, the data processor must be able to find the physical location in the stored data in order to access the data item corresponding to that key. For example, if the data is stored on a magnetic or optical disk, the track and sector address is required before a particular data item can be accessed. In the case of hardware storage in random access memory (RAM) the storage address is required.
In order to deduce the storage location for a particular data item that item's key may be converted directly into a numerical value, which is in turn scaled or otherwise manipulated to produce a storage address lying within the possible range of addresses appropriate to the storage means in use. When the data item is originally stored, it is stored at the address calculated as above from the item's key. Again, if subsequent access to the item is required, the item is searched for at an address calculated from the item's key at the time at which the access is required.
The simple addressing system described above may be fast in operation but suffers from the fact that the key values are not usually uniformly distributed across the range of possible key values. For example, if a subject's personal name is used as the key, it will be clear that certain strings of characters such as "SMITH" will appear many times as a key, while other strings such as "ABCDEFG" are extremely unlikely to appear. In the direct key to address conversion system described above, this will lead to bunching of the data items around certain storage addresses, while other addresses will remain unused. This can lead to very inefficient use of the available storage space. To overcome this problem, a technique known as "hashing" is commonly used to calculate a storage address from an item's key.
In the hashing technique a data item's key is converted into a near-random number, which is then scaled as above to provide the storage address for that item. Typically the storage address refers to an area (known as a "bucket") in which a group of items can be stored. The procedure for storing an individual data item is therefore as follows:
a) The item's key is hashed to produce a bucket identifier n, where n is a near-random integer in the range from 1 to N, the total number of buckets available. PA1 b) The physical storage address of bucket n is obtained (for example, by reference to a look-up table of bucket addresses). PA1 c) If bucket n is not already completely filled with data items, the item is stored in this bucket. PA1 d) If however bucket n is full, the item is stored in the first available space in an overflow area. A pointer could be associated with bucket n to identify the overflow address in which the item is stored. PA1 a) The item's key is again hashed to produce the same bucket identifier n. PA1 b) The physical storage address for bucket n is again obtained. PA1 c) The item is searched for in bucket n. PA1 d) If the item is not found in bucket n, the overflow area is searched. This additional searching may considerably slow down the retrieval of the data item. PA1 a) The algorithm must be fast. In other words, only a small amount of processing must be required to obtain the bucket address from a particular key; and PA1 b) The algorithm must provide a near-random distribution of bucket addresses from the file keys, even if the keys are systematically related to one another. This feature is necessary to avoid the time-wasting use of overflow areas.
Similarly, when access to the item is subsequently required, the following steps are performed:
Several hashing techniques are well known in the data processing art. It is generally necessary first to express the file key as a series of digits. In the case of a key containing alphabetical characters, these digits might for example be the ASCII codes for the characters. Once the key has been expressed numerically, a number of algorithms may be used to obtain the near-random number (n) from the digits corresponding to the key. In the "mid square method" the numerical value corresponding to the key is squared; the central digits of the result are then scaled to become the number n. In another method, the polynomial division method, each digit in the key is treated as a polynomial coefficient (for example, the key 7562 would become 7.times.3+5.times.2+6.times.1+2.times.0). This polynomial is then divided by another, fixed, polynomial, and the remainder from this division is scaled to become the number n. These and several other hashing techniques are described in the book "Computer Data-Base Organization" (James Martin, Prentice Hall 1975).
EP 0268373 describes a hashing method and apparatus in which a database address is obtained by first replacing each character in the key with a number obtained from an unchanging table of random numbers. The locations used in this table of random numbers are calculated directly from each character in the key. The selected numbers are then combined by interactive bit reordering and exclusive-OR operations to form the near-random number n.
A further prior art hashing technique is described in the article "Fast Uniform String Hashing Algorithm" (IBM Technical Disclosure Bulletin No. 10, 1989, p118).
It will be clear from the above description that there are two principal requirements of a good hashing algorithm: