The present invention generally relates to the art of data storage and retrieval, and more specifically to a physical address on a storage medium during such storage and retrieval.
One well known type of data organization and access method for rapidly accessing data stored in main memory or in a file is xe2x80x9cHashingxe2x80x9d. It is in particular used heavily in database systems to efficiently access records. It operates by extracting one or more fields, usually from the record, to form a xe2x80x9cHash Keyxe2x80x9d. Then, a function (xe2x80x9cHash Functionxe2x80x9d) is applied to the hash key to identify a xe2x80x9cHash Bucketxe2x80x9d. If xe2x80x9cKxe2x80x9d represents the Hash Key, xe2x80x9cFxe2x80x9d represents the xe2x80x9cHash Function, and xe2x80x9cBxe2x80x9d represents a Hash Bucket, then B=F(K).
The physical disk sectors of Database files are typically grouped together in Pages. Physical and logical references to these files are done by accessing these Pages. Databases which are accessed via a Hash function have zero, one, or more Hash Buckets associated with each Page which holds data. The methods proposed by this invention can be extended to cover the cases other than one Hash Bucket per page.
At its simplest, especially when utilizing hashing in main memory, a contiguous series of Hash Buckets or Hash Entries form a Hash Array, with the Hash Bucket number or Hash Table Index computed by applying the Hash Function to the Hash Key being used to index into the Hash Array.
The problem is a little more complex when dealing with databases and hash files. In those instances, each Hash Bucket can typically contain multiple records, and Hash Buckets are organized into a Hash Table. In order to determine whether a record with a specified Hash Key exists in a database or hash structure, the corresponding Hash Bucket can be computed by applying the Hash Function to the Hash Key for the record. Then, the Hash Bucket is searched for the record containing that Hash Key.
A problem arises however when an entry in a Hash Array or a Hash Bucket fills up. When a subsequent record hashes to the same Hash Entry or Hash Bucket, you get what is termed a xe2x80x9cCollisionxe2x80x9d. A number of different algorithms have been developed to address this Collision problem. For example, overflow pages or blocks can be chained to the Hash Bucket. Alternatively, the record can be stored in the next available Hash Bucket that has room. Then, when searching for a record with a given Hash Key, the search starts at the Hash Entry or Hash Bucket addressed by the Hash Function applied to the relevant Hash Key. The search progresses through the Hash Table, and ends in failure when a record is not found in a Hash Entry or Hash Bucket that is not full.
This later method works well in situations, such as compiler symbol tables, where there are insertions into a Hash Table, but not deletions. It fails however when there are deletions, as in the typical database, since deletions create holes, which could prematurely terminate the search for a matching Hash Key in failure.
Another problem that arises in Hashing when dealing with databases and files is when a Hash Table contains discontiguous blocks of Hash Buckets. There may be some pages uniformly dispersed throughout the hash table that may not be capable of containing data records or Hash Buckets. This can happen when, for example, space control information for managing the file content is located on pages spread uniformly through the file. These pages will be referred to as Space Control Pages for the purposes of this disclosure. This application is related to our copending patent applications assigned to the assignee hereof.
There is a related problem when the Hash Buckets do not start at the first of the space used for hashing. The problem is that Hash Functions typically generate a continuous range of Hash Bucket number. For example, if the Hash Function involves dividing the Hash Key by a specified prime number, and using the remainder as the Hash Bucket number, then the resulting Hash Bucket numbers will typically comprise all of the integers between 0 and Primexe2x88x921. However, information containers such as files typically contain header information which takes pages at the beginning of the file.
The problem that arises here is that when the Hash Table contains holes (and thus discontiguous blocks of Hash Buckets), including a hole at the beginning, the situation must be handled when a Hash Key hashes to one of the holes. The typical overflow procedure of going to the next open Hash Bucket is suboptimal. This suboptimality results in reduced performance. One reason for this suboptimality is that the next Hash Bucket after each hole would become overloaded with hash entries, since that Hash Bucket must not only support and contain the records containing Hash Keys that hash to that Hash Bucket, but also all of those in the preceding hole. The bigger the xe2x80x9cholexe2x80x9d in the Hash Table, the worse the problem. For example, if a Space Control Page is the same size as a Hash Bucket, then the next Hash Bucket after such a Space Control Page would fill up twice as fast as the other Hash Buckets using such an overflow procedure. Similarly, if a Space Control Page is twice the size of a Hash Bucket, then the next Hash Bucket after such a Space Control Page would fill up three times as fast as the other Hash Buckets using such an overflow procedure.
One solution to these problems is found in U.S. Pat. No. 5,579,501, incorporated by reference herein. The solution embodied in that patent allows the direct computation of the table page which contains the hash bucket. It performed this computation using integer arithmetic and involves several computational steps. The present invention involves fewer computational steps in all cases.