1. Field of the Invention
The invention relates to a system and method for locating and retrieving information from very large databases. Such a system and method are particularly useful, for example, with electronic mail systems, which require fast retrieval of user directory information for routing large volumes of messages.
2. Discussion of the Prior Art
Management of information has become critical to modern civilization. Information that is collected together in an organized fashion is referred to as a database. A database file conventionally consists of a group of records, with each record then being subdivided into one or more fields. For example, a database for routing e-mail might contain the database records:                smith@company.com|Jane|Smith|password|/usr/js/mailbox        jjones@company.com|John|Jones|secret|/usr/jj/mailboxIn this example, each database record consists of five fields: an e-mail address field, a first name field, a last name field, a mail password field, and mailbox location field. The record terminator is a new line character and the field terminator is the |character. Each record may have a different location in memory. The record offset (i.e., the position of the record in the database memory relative to a reference point) for the record containing the term “jsmith@company.com” may be 0, for example, while the record offset for the record containing the term “jjones@company.com” may be 54.        
The information within one or more of the fields for a record may be used to locate and retrieve the record from the database. The field information used to retrieve a record is commonly called the key, and the field in which this key information is stored is called the key field. In the e-mail routing database described above, for example, the key might be the user's e-mail address. Thus, when someone wanted to retrieve a user's mailbox location, he or she could employ the user's e-mail address to locate and retrieve the database record containing the same e-mail address (i.e., a matching key) in its key field. Preferably, a database is optimized so that record retrieval based on each record's key is fast and efficient
One method of locating and retrieving a record from a database is to sequentially access and search each record's key field until a matching key is located. This method of record retrieval is referred to as a linear search. However, as the number of records in a database increases, it is neither fast nor efficient to sequentially examine each record to find the one with a matching key. To improve record retrieval speed in a larger database, an index table is often built for the database.
The use of index tables to improve database record retrieval speed is well known in the art. One method of employing index tables is the indirect accessing method. With the indirect accessing method, only a pointer list is accessed directly. Each pointer in the list identifies the location of a record in memory, and the pointer's position in the list is defined by that record's key. Thus, a key can be used to quickly obtain the pointer, and thus the address, for the record with the matching key.
According to this method, the index table can directly index each record's pointer by that record's key. If the key information has a large number of possible values, however, then the index table will require a correspondingly large amount of memory. For this reason, index tables typically use a key-to-address transformation algorithm to index the pointers. That is, the pointer for each record is indexed by a mathematical transformation of the record's key, rather than by the key itself.
The key-to-address transformation is often performed using a hash (or hashing) function. A hash function is any process that maps data to a numerical value. For example, one hash function may convert the characters of a key into their ASCII value, add the ASCII values, and then divide the added ASCII values by a prime number to produce a remainder as the hash value. Because the hash function can be selected to limit the maximum possible hash value of a key, indexing the records against hash values reduces the amount of memory required for the index table.
The use of a hash function presents an additional problem, however. A hash function may generate the same hash value for two different keys from two different records. This is referred to as a collision. When this occurs, the hash value cannot be used to uniquely identify the location of the record in the database.
Several methods for resolving such collisions are described in the prior art. The separate chaining method creates a linked list of records whose keys have the same hash value. Once a hash value is obtained from a key during a search, each linked record for that hash value can be reviewed until a matching key is found. With the linear probing method, the hash value identifies a specific location in the index table. If this location does not contain a matching key (or an address for a record with a matching key), each subsequent memory location is probed until a matching key (or an address for a record with a matching key) is found.
The double hashing method extends the linear probing method to avoid the problem of clustering that can make linear probing slow for tables that are nearly full. The double hashing method uses two different hash functions. The first hash function identifies a specific location in the index memory, and the second hash function identifies a further address offset from that initial location.
As the number of records in a database increases, however, these methods for collision-resolution become less efficient. The number of collisions increases with the size of the database, causing the amount of memory required to implement the collision resolution methods to increase as well. Also, collision-resolution requires access to the record's key for comparison with the search key. In the case of indirect access, retrieval of the actual record for key comparison degrades the performance. While the key may be stored in the index table for ready comparison when collisions occur, this alternative significantly increases the size of the index table.
Further, the hashing function itself becomes more difficult to implement as the number of records in a database increases. For the open addressing methods, such as linear probing, the index table size must be greater than the number of records in the database. For the commonly used “remainder of division” hash function, the size of the hash table should be prime, and computing the hash value for long keys can be expensive in terms of processing time.