Storing records in a data table and retrieving the records are common tasks. Various data structures, table organizations, and access techniques have been utilized to determine a location for storing an element of data and to determine the location in which an element of data has been stored. In general, the data may be stored in a table of records or elements, where each element has a collection of fields associated with it. In the table, each field is associated with one of a number of attributes that, together, make up the element. One of the attributes is the “key” that refers to the element and on which the searching is based. Various techniques for organizing a table include lists, binary search trees, digital search trees and hash tables.
A serial or linear search algorithm searches through the data table one slot at a time until an available slot is discovered. Thus, starting at the beginning of the table, each slot is examined until an empty slot is found. Of course, this may be very time consuming if the next available slot for a 1,000-location data table is 600 slots away, since 599 slots will have to be checked before an available slot is found.
In hash tables, an element is stored in a table location that is computed directly from the key of the element. That is, the key is provided as an input to a hash function, h, which transforms the key into an index into the table. That address is known as the home address of the value. For example, a data base may contain 50 records of people with social security numbers as the key or ID number. A hash function which maps the keys onto a hash table of 100 elements is:h (social_security_number)=social_Security_number mod 100That is, the hash function of a social security number is the rightmost two digits of the number. For example, h (123456789)=89.
If the location of the table addressed by the index (represented here as h[key]) is empty, then the element may be stored there. In the ideal situation every key, when hashed, produces a unique index. This situation, known as perfect hashing, is very difficult to achieve unless a data table designer knows beforehand details of the records to be stored or the hash table size is large with respect to the number of data elements to be stored. Often, however, two or more records may hash to the identical physical location, the records' home address in the data table. This is known as a collision. In the above example, a collision would occur if a second social security number were 765432189. Both keys would hash to 89. When a collision occurs among a group of records, the records may be stored in a chain joined together by links. A first record may be stored at the home address, along with a link to the address of the second record. A link stored with the second record may point to a third record, and so on. These linked records represent what is known as a chain. A mechanism is needed to relocate colliding records to available slots in the data table and to link pieces of the chain together.
Although there are a number of methods that attempt to relocate records of a particular chain and link the records together, the methods have drawbacks. One method of measuring the effectiveness of these methods is to compare how many probes (a probe is a physical access of a location in the data table) are needed on average in order to retrieve each record once. For example, a chain of three records that are linked in a simple way so that each probe also identifies the exact location of the next record in the chain would require one probe for the first record, two probes for the second record (a stop at the first record before going to the second) and three probes for the third record. This provides an average of two probes (6 probes/3 records) to reach each record once. Current methods of hashing may produce long chains. When a collision is detected, these methods may examine only one other location for insertion of one of the colliding records, placing a link from the current location to the other location. If the hash table is relatively full, these methods may create a long chain before finding an empty location.
Another way to compare hashing methods is to examine the amount of extra storage that is required in a hash table in order to link the chains. Some methods may use the full address of a location in a table as a link. The links then require the same amount of storage as the addresses. As an example of storage for a link field, the table below has seven locations (0-6), and the links specify the entire address.
AddressValueLink000 (0)47001 (1)23011 (3)010 (2)011 (3)52000 (0)100 (4)101 (5)110 (6)
In the table above, three records were inserted in the following order: 23, 52 and 47. All three records are assumed to hash to the same home address of 1. The table shows that to access record 47, you must first examine location 1, the target home address for record 47, find that the record stored there is not a match for record 47, and then follow the link field (indicated by 3) to location 3. Again, the record there is not a match, and the link at location 3 is followed to location one, where a match is found. In general, the process of following links continues until a matching record is found or a blank for the link is found and a conclusion is made that this search was unsuccessful. In the case illustrated below, the extra storage for the link field is three bits since the link is the same size as the address. For a larger table with link size the same size as address size, many more bits would be needed for the link. Some tables may require 20 bits or more for addresses.
A third method of comparing hashing functions is to examine the ease of insertion of the new records into the data table. A method that relocates records in the chain away from their home addresses will cause those records to occupy positions in the table that can, in turn, be the home locations for other records. This can result in two or more chains being interlinked, which is referred to as coalescing. Coalescing can cause the number of probes to increase, since a search would not only have to traverse a chain of records with common home addresses, but also the records of another chain that are interspersed with the first chain.