1. Field of the Invention
The present invention relates to file organizations for computer systems. More particularly, it relates to the accessing of data in secondary storage systems, such as disks, in a minimum number of access attempts.
2. Description of the Prior Art
The secondary storage devices in large computer systems provide for the storing, updating and retrieving of data to and from large collections of data in the main memory of the computer. The organization of such data, termed files, is obviously important to make accessing efficient. In addition, it is important to be able to insert (and delete) new data elements into and from the files particularly on random access secondary storage devices, such as disks. Such files are termed "dynamic" files.
As is well known to skilled computer designers and system programmers, many techniques for structuring such files have been proposed, with the B-Tree index structure being the present standard for commercial equipment. The article by D. Comer, "The Ubiquitous B-Tree", Computing Surveys, Vol. 11, No. 2, June 1979, pages 1-137 contains a good review of B-Trees.
Another, more recent type of file organization scheme suitable for dynamic files is extendible (also known as expandable or dynamic) hashing. A number of techniques have been developed that permit extendible hashing to be used as a fast method to access large files residing on external storage for files of fixed size as well as for files which increase in size. For example, the article by R. Fagin et al entitled, "Extendible Hashing--A Fast Access Method for Dynamic Files", ACM Trans. Data Base Syst. Vol. 4, No. 3, September, 1979, pp. 315-344, describes the access technique of extendible hashing which, unlike conventional hashing, has a structure which grows and shrinks as the file does. The Fagin et al method separates the hash address space from the address space of the data by employing an index between the hash function and the disk address where data is stored; and it generates more bits than are required initially to identify the index term. However, Fagin et al. require close to two disk accesses per data access once the file is sufficiently large that only a small portion of the index fits in the main memory.
Litwin in his article entitled, "Linear Virtual Hashing: A New Tool for Files and Tables Implementation", published in Proc. 6th Int'l. Conf. on Very Large Data Bases, Montreal, 1980, pages 212 to 223, describes a dynamic hashing function, called a linear hashing function, in which the hash addresses of the keys are changed in some predefined order instead of changing the hash address for the data whose page has overflowed. This has the advantage of causing the space allocated for the file to grow linearly by the addition of contiguous pages to the end of the current file. However, Litwin assumes the existence of a contiguous, continuous address space, which is not an effective way to utilize the space efficiently. While he describes a method of mapping his page numbers to disk addresses, his method of utilizing the disk space has the result that the cost, in disk accesses, to add an additional primary page to his file typically requires three accesses per page. Further, the number of overflow pages used and the performance are not as favorable as those of my invention.
The paper by G. Martin, entitled, "Spiral Storage; Incrementally Augmentable Hash Addressed Storage", Theory of Computation, Report No. 27, U. of Warwick, Coventry, England, March 1979, describes a hashing technique in which the keys are mapped into the address space so that they tend to be more dense at one portion of the space than at another. During file growth, keys which used to occupy the more dense space are spread over the new, less dense space. Martin uses a hash function mapping the keys onto the space exponentially rather than uniformly. However, Martin's method of mapping the relative pages generated by his hash function into real disk addresses is complicated and expensive in disk accesses per primary page added. Further, his method of handling overflow records involves rehashing, which can result in adverse performance, particularly on unsuccessful searches.
These extendible hashing techniques do not require a complete file reorganization and rehashing to cope with file growth or shrinkage. In addition, they provide faster random access than is typically provided by tree index methods, such as B-trees. They also provide for a limited form of sequentiality, i.e., the ability to sequence through the records of the file in some order, though not in key order. However, none of these hashing methods alone provides for a combination of advantages which is desired in file addressing, i.e., a single disk access, straightforward storage management of the underlying disk space and avoiding the necessity for rehashing to cope with collisions.
A characteristic of all extendible hashing schemes, with the exception of the spiral storage described in the above-referenced paper by Martin, has been oscillatory performance. The hash function distributes the hashed keys uniformly over the pages of the file. Thus, these pages fill up uniformly and become completely filled almost simultaneously. Within a small period of further file growth, the large majority of file pages all overflow and their entries must be split over two pages. The result is that utilization swings between 50% and almost 100%, suddenly "crashing" to 50% during the short splitting period. In addition, the cost of doing an insertion is comparatively low at low utilizations but is considerably higher during the splitting period because so many insertions lead to page splitting. Finally, if overflow records are required by the technique, as they usually are, then the frequency of occurrence of overflow increases dramatically as utilization approaches 100%. This results in a sharp increase in the cost (in terms of disk accesses) of insertions and searches as accessing of the overflow records becomes increasingly common.