Data are commonly stored in a database in a relation or table where rows in the table are called tuples and columns are attributes having unique names. A name attribute in a specific tuple is referred to as an item. A key is a subset of attributes whose values are used to uniquely identify a tuple.
One method for storing and retrieving tuples involves ordering the tuples sequentially based on their keys, for example, in alphabetical or numerical order. However, the best technique of searching for a tuple in a sorted table requires at least an average of log.sub.2 M probes, where M is the size of the table. For example, an average of at least 10 probes are required to find a tuple in a table having 1024 entries. One can do much better than this using a table referred to as a scatter storage table.
The fundamental idea behind scatter storage is that the key associated with the desired tuple is used to locate the tuple in storage. Some transformation is performed on the key to produce the database address where the tuple is stored. Such transformations are referred to as "hash functions" and the data storage and retrieval procedures associated with scatter storage are known as data hashing. A good hash function is one that spreads the calculated addresses uniformly across the available addresses. If a calculated address is already filled with another tuple because two keys happen to be transformed into the same address, a method of resolving key collisions is used to determine where the second tuple is stored. For example, with a known method referred to as linear probing, the second tuple is stored at the next available address.
A number of hash functions have been developed for transforming keys comprising character strings. Such functions are useful, for example, in hashing symbol tables for computer program compilers and typically form various numerical or logical combinations of bits representing the string characters to determine a positive integer. The database address is then derived as a function of the positive integer. For example, the database address may be obtained by dividing the positive integer by a constant and taking the remainder. However, with known hash functions, patterns in the data frequently result in a significant clustering of data rather than a uniform distribution. When the data are stored in pages on a disk, the clustering can result in many tuples being stored on overflow pages other than those defined by the calculated addresses. This leads to slow data access since retrieving particular tuples will frequently require more than one disk access operation. Increasing the total available storage such that the data are more sparsely packed can improve data access times but only at substantial expense.
In view of the foregoing, a recognized problem in the art is the clustering of data that occurs when character strings are hashed by performing manipulations and combinations of the individual characters.