1. Statement of the Technical Field
The present invention relates to data storage and retrieval systems and more particularly to a method and system for relocating records that hash to the same location in a data table and storing said relocated records in an optimal and efficient manner.
2. Description of the Related Art
Storing records in a data table is a common task. Applications are designed to retrieve banking records, credit records, employee records, student records or any other type of record using various search algorithms. Many common techniques employ search algorithms to search through the data table in order to place records in empty slots or “buckets”. Serial search algorithms and hashing algorithms are a few examples.
A serial or linear search algorithm searches through the data table one slot at a time until an available slot is discovered. Thus, starting at the beginning of the table, each slot is examined until an empty slot is found. Of course, this may be very time consuming if the next available slot for a 1,000-location data table is 600 slots away, since 599 slots will have to be checked before an available slot is found.
Hashing is a method that stores data in a data table such that storing, searching, retrieving, and inserting and deleting data can be done much faster than by traditional linear search methods. Hashing is very useful in scenarios where data record keys do not map directly into data table locations. As an example, if 100 student ID numbers all fall within a particular value range, e.g., 0 to 99, it would be simple to map each data record into a corresponding slot in the data table. Student ID Number 1 would be mapped into data slot number 1, etc. Thus, each data record, identified by a “key” value, is mapped directly to a corresponding slot, so retrieving the record at a later date would be immediate. However, if the student ID numbers do not range from 0 to 99, but instead range from 0 through 9999, a different situation is presented. A data table comprised of 10,000 slots could be constructed, but this is wasteful since only a small fraction of the table (one-tenth) would be used to store 100 data records.
In the above example, a hash function can be created to store the 100 records in an array of a much smaller size in order to efficiently store, and later retrieve, each of the records. For example, if the student ID numbers (each ID number is a “key” that identifies each record) were known to be multiples of 100, e.g., 0, 100, 200, . . . 9800, 9900, a hash function could be constructed to store each record in an array comprised of only 100 slots. Therefore, an array called data can store a record with a student ID number “x” at index data [x/100] (where only the quotient is used and not the remainder). Thus, information relating to a student with ID number 600 can be stored in array slot with index number 6, i.e. data [6].
The result above represents the ideal situation where every key, when hashed, produces a unique index. This is known as perfect hashing and is very difficult to achieve unless the database designer has every record before them prior to creating the data table. The common scenario is when two or more records hash to the identical physical location, (i.e., the record's “home address”) in the data table. This is known as a “collision”. In the above example, a collision would occur if the student ID number is not 400, but is instead 399. The record with student ID number 300 is stored in index number 3 (300/100=3), but so is the record corresponding to student ID number 399 (399/100=3), which has the identical quotient. Two or more records that hash to the identical home address represent what is known as a “chain”. A mechanism is needed to relocate records to available slots in the data table and to link pieces of the chain together.
There are a number of hashing functions that attempt to relocate records of a particular chain. However, each has its drawbacks. A method of measuring the effectiveness of these methods is to compare how many probes (a probe is a physical access of a location in the data table) is needed on average in order to retrieve each record once. For example, a chain of three records that are linked in a simple way so that each probe also identifies the exact location of the next record in the chain would require one probe for the first record, two probes for the second record (a “stop” at the first record before going to the second) and three probes for the third record. This provides an average of two probes (6 probes/3 records) to reach each record once.
Another way to compare hashing methods is to examine the amount of extra storage that is required in the table in order to link the chains. As an example of storage for a link field, the table below has seven locations (0-6).
047123323520456
In the table above, three records were inserted in the following order: 23, 52 and 47. All three records are assumed to hash to the same home address of “1”. The table shows that to get to record 47, you must first go to location 1, the target home address for record 47, find that it is not a match for record 47 and then follow the link field (indicated by “3) to location 3. This process continues until a matching record is found or a blank for the link is found and a conclusion is made that this search was unsuccessful. In the case illustrated below, the extra storage for the link field is three bits since a link of “6” (i.e. binary digits 110) may have to be stored. For a larger table, many more bits would be needed for the link.
A third method of comparing hashing functions is to examine the ease of insertion of the new records into the data table. A method that relocates records in the chain away from their home addresses will cause those records to occupy positions in the table that can, in turn, be the home locations for other records. This can result in two or more chains being interlinked, which is referred to as “coalescing”. Coalescing can cause the number of probes to increase, since a search would not only have to traverse a chain of common “home” records, but also the records of another chain that are interspersed with the first chain. Methods for eliminating coalesced chains require that records which were previously inserted need to be moved every time two chains are about to coalesce.
Double hashing methods utilize two hash functions. The first hash function produces the home address of the record to be inserted into the data table. A typical algorithm used to determine the home address of the record to be inserted is: HOME=key mod P, where P is the number of positions in the data table and must be a prime number. The second hash function is used to create a variable increment, which is used to skip a number of positions in the data table in an attempt to find an empty slot.
One double hashing technique known as the Linear Quotient method can use the following algorithm to determine the variable increment: INC=1+key mod (P−2). If a new key collides with a key already at its home address, an increment is computed using this function applied to the new key that is to be inserted. This results in a “jump” of a number of positions corresponding to the increment as many times as necessary until an opening in the data table is found. The data table is considered to be circular so that once the bottom of the table is reached, the count wraps around to the top of the table. The Linear Quotient method does not require any link fields but does allow chains to coalesce. On the average, this method requires a high number of probes.
Another hashing technique commonly used is the Computed Chaining method. This method can use algorithm INC above, but applied to the key already stored at a location, and uses that increment to jump as many times as necessary to find an empty location in which to place the key. The multiplier of the increment is then stored in the table as a “number of offsets” field. That field normally requires six bits for a table size of approximately 1,000 records. However, this field can be limited to any number of bits by requiring more intermediate probes. Coalesced chains are resolved by moving keys that are in the way of the new key and that are not at their home addresses. However, this requires movement of all the records that followed the moved key in the chain.
Because of the obvious drawbacks of the two aforementioned techniques, it is desirable to have a double hashing data storage system and method that results in the an optimal data record retrieval performance by lowering the number of average probes and employing a much more efficient method of inserting new keys into the data table.