With any database, it is necessary to have a file structure and access method which enables efficient access to data records. Hashing techniques which access records by transforming keys to addresses are well known in the art. A hash table comprises an indexed data file which is held in storage with the index being used to associate data keys with addresses of particular data storage blocks or "bins" within the file. An input hash key is transformed by a hashing algorithm (which may be a simple numerical division or a more complex transformation) and is then compared with the index of the hash table held in system memory to obtain the address of the relevant bin within a data file which is held in a peripheral storage device. A determination is then made of which data elements in this data file bin are the relevant ones for the key. This determination is known as "disambiguation" or "collision handling", and may be as simple as comparing the hashed key with the stored keys of elements in the file.
There are problems with pure hashing techniques in relation to large databases. If bins are too large (i.e. there are too many data elements in each separate data file bin) then the collision handling part of the data access process is too slow. If bins are too small then the hash table itself is too large and takes up too much disk space. For a very large database, conventional hashing tables will take up too much disk space for any reasonable bin size and may be too large to transfer their index into memory. However, hashing is still a very effective access technique for smaller databases.
Hashing was originally used with static data structures (`static` in the sense that the extent and structure of the data remain unchanged during processing and only data values are updated). The first adaptations of such static structures to allow for insertions and deletions required deletion-flags and pointers to `overflow bins` which were separate from the main data structure. Frequent expensive restructuring of the data structure was required (typically when the number of holes left by deletions, and overflow areas created by insertions, became sufficient to degrade performance significantly). Adaptations of hashing techniques for use with dynamic databases required costly `rehashing` whenever a bin within the hash table became over-full (i.e. when the keys pointing to an individual disk file have too many data items associated with them). Rehashing involves the choice of a new fixed size for the hash table and bins within the table, a new hash function, and relocation of all records within the table. Opting for a hash table size and bin size that uses a high estimate of the number of records to be placed therein would minimize rehashing frequency but would also result in valuable disk space being wasted. Underestimating hash file storage requirements results in a large number of overflow records (slowing down searching and updating) and frequent rehashing. More efficient and adaptable file organisations and access techniques are required for dynamic databases.
An access method dubbed "extendible hashing" was described by R. Fagin et al in "Extendible Hashing--A Fast Access Method for Dynamic Files", ACM Transactions on Database Systems, Vol.4, No.3, September 1979, pages 315-344. This paper describes a particular adaptation of hashing which makes hash tables extendible by separating the hash address space from a directory address space. The hash table, which includes a directory with each entry pointing to a disk file or bin within the table, is extendible since additional bins are added if a bin overflows and the directory is extended to use an additional bit of information from each input hash key to distinguish between the increased number of bins. The most significant bits of the hash address are used as the index to the address space and the number of significant bits used is increased if any bin overflows. This extension is done without rehashing the whole hash table. The capacity of each bin is kept constant such that the hashing algorithm is unchanged. There is a problem with this solution since the directory doubles in size each time a bin overflows, and so the hashing table which exists to enable efficient access to data elements may itself take up too much of the available disk space. This is a particular problem for very large databases to which data records are added randomly and for which the hash table is too large to be held in memory. Furthermore, certain operating systems impose a restriction on the maximum size of a hash table such that they are not extensible beyond that maximum.
The Fagin et al paper also describes the same solution from a different perspective, referred to as a "balancing" of radix search trees. Radix search trees examine an input key one digit or letter at a time, and the search focuses on a particular branch of the tree at each step through the tree hierarchy. Radix trees can provide faster access than search trees which compare whole keys, but they typically use more memory space. Fagin et al propose a `flattened` directory structure which seeks to maximize access speed by flattening the directory structure such that a single pointer from the directory accesses a required file. The depth of the directory (how many levels are flattened into one, and so how many significant bits of a key are used in a search) can be varied as necessary to guarantee access to a required file in a single probe. Each time a "leaf" (a page of memory at the bottom of the directory hierarchy) overflows, requiring a new directory level, the directory is doubled in size to extend it while keeping a flattened structure. The radix tree has thus degenerated into a one-step access mechanism for maximum speed, but at considerable cost in memory unless the keys are very uniformly spread over the key space.
P. A. Larson's paper "Dynamic Hashing", BIT vol.18, No.2 (1978), pages 184-201, describes a further file organisation based on hashing in which the allocated storage space can be increased and decreased without rehashing the whole file. This is achieved by providing an index to a data file which index is organised as a series of binary hash trees, nodes of the index including pointers to particular buckets within the data file. The problem which arises with this file organisation is that the data file can become difficult to manage as a contiguous file as data is added and the file grows, and so this solution is not well suited to large databases.