The present invention relates generally to dynamically managing hash pool data structures. More particularly, the present invention relates to dynamically adding index and contention records to hash data pools in response to data load and data distribution.
Hashing is a technique for storing and retrieving data records. Hash functions provide quick access to individual data cells in a group, via a hash table, without linear searching. Therefore, hashing is especially beneficial for computational problems that grow into a very large scale in terms of data volume. A variety of hash functions can be implemented in order to minimize collisions between key values and to minimize processing time for a particular group of data key values. The decision to use a particular hash function is implementation dependent. A collision between key values occurs when the output of the hash function is the same for two or more records. When a collision occurs, one of the key values might, based on bucket size, be stored in another location and not the location indicated by the hash function. Collision resolution management refers to the process of determining where the other location is and tying the key value into the hash pool data structure in a manner that is accessible at a later point in time.
If the hash table is very large, it may not fit into available memory. A large hash table may be kept on a disk storage device such as DASD (Direct Access Storage Device). When the hash table is kept on disk storage the hash function or the operating system, in response to a request from the hash function, reads a portion of the entire hash table and must manage the process of paging the hash table. When this happens, the speed of processing can be diminished because of the extra time for the Input/Output (I/O) operations and the extra processing overhead required to manage the hash table. When the hash records are stored on external storage devices such as DASD, it is important to minimize the number of accesses to the hash records. When the hash records are stored on DASD the hash function may be compute intensive with the goal of minimizing the number of I/O operations required to insert or retrieve a record from the hash pool data structure. Hash technique exists that keep a small piece of information in memory to quickly find the hash table entry. If the hash table is very large even these techniques exhaust main memory so this associated information must also be placed on DASD which increases the I/O required.
An index can be used to reduce the required hash table space. This option requires at least one extra I/O operation. The benefit is a reduction in the number of required hash table records that must be pre-allocated. The number of xe2x80x9cslotsxe2x80x9d that can be contained in a hash table record determines how much the hash table space can be reduced. If a four-thousand and ninety-six byte (4K) record size is used, and an allowance is made to use ninety-six bytes for system overhead, the 4K record could hold five-hundred eight byte storage addresses. In this scenario, the use of a one level index could reduce the required number of pre-allocated hash table records by a factor of five hundred. A second level index could reduce the number of pre-allocated hash table records by another factor of five-hundred, resulting in a reduction of two-hundred and fifty thousand. Similarly, a third level index reduces the requirement by a factor of one-hundred and twenty-five million but adds three required I/O operations in retrieving a data record if the index records are stored on DASD.
Contention of keys to the same hash slot can be handled with record chaining. A look-up routine can detect that there is more than one record and can examine the contents of the record for an exact match. The average number of I/O operations required in chaining can be expressed by the equation xe2x80x9c(k+1)/2xe2x80x9d where xe2x80x9ckxe2x80x9d is the length of the chain. The chain can be implemented using a linked list that connects the records together. After a hash function is performed on a key value, a sequential search is performed on the linked list until either the key value is found or the spot to insert the key value is located. The linked list could be sorted in terms of frequency of access or alphabetically to expedite the search.
Contention of names can also be handled with contention records. A look-up routine can detect that the record is a contention record and that the record contains a table of key values and record addresses. The look-up routine can go through the record looking for a match of the key value, and if found, can use the record address that corresponds to the key value to locate a user data record. Any of the above methods: indexes, contention records and chaining; can be utilized when contention is encountered in a hash pool data structure. Which one is selected depends on balancing factors that can include the number of I/O operations, the required storage space, the number of hash table entries, and the required speed of data access.
An exemplary embodiment of the present invention is a method for dynamically managing a hash pool data structure. A request to insert a new key value into a hash pool data structure that includes at least one index level is received. An insertion location is calculated for the new key value in response to the new key value and to existing key values in the hash pool data structure. The insertion location includes an index level. A new index level is added at the insertion location if the index level is not the maximum number of index levels in the hash pool data structure; if the insertion location contains a chain of existing key values with a length equal to the maximum chain length; and if the new index record locations of the new key value and the existing key values are dispersed. The insertion location is updated in response to adding a new index record and the new key value is inserted into the insertion location. An additional embodiment includes a storage medium for dynamically managing a hash pool data structure.