Methods of organizing large files so that some form of random access is supported have been recognized as important in the art. A particularly successful organization is that of B-trees. This approach is described in an article entitled "Organization and Maintenance of Large Ordered Indexes" by R. Bayer and E. McCreight, in Acta Informatica 1.3 pp. 173-189 (1972). In general, each node in a B-tree of order k contains at most 2k keys and 2k+1 pointers. Actually, the number of keys may vary from node to node, but each must have at least k keys and k+1 pointers. As a result, each node is at least 50% full, but generally much higher. In the usual implementation a node forms one record of the index file, has a fixed length capable of accommodating 2k keys and 2k+1 pointers, and contains additional information telling how many keys reside correctly in the node.
Several variances of B-trees have been developed as described e.g. in "Prefix B-Trees", by R. Baier and K. Unterauer, ACM Transactions on Data Base Systems, 2.1, pp. 11-26 (March 1977). An important advantage of B-tree organizations over hashing methods is that not only random access is supported, but also sequential access.
A B-tree is, of course, a tree structured organization, and like all trees used in searching, it is desirable to minimize its height so as to keep the access path to its leafs as short as possible. This is particularly important when dealing with large files since accessing a node in the tree can mean an additional access to external storage. Since such external storage is usually a disk with a substantial seek time and rotational delay, each such additional access is quite expensive.
In order to reduce the height of B-trees, a modified version, called B*-trees, was introduced (cf. E. McCreight McRide, "Pagination of B-*Trees with Variable Lengths Records", in Communications of the ACM, September 1977, Volume 20, No. 9). The two distinguishing properties of B*-trees which separate them from B-trees are:
a) All records of the file are stored in leaf nodes, thus other nodes of the tree contain only index entries. PA1 b) The number of necessary splitting operations of nodes is reduced by the use of an overflow technique that increases the average storage utilization of each node.
The basic B-tree organization can be further improved by key compression techniques, as suggested by D. E. Knuth "The Art of Computer Programming", Volume 3/Sorting and Searching, Addison-Wesley, Menlo Park, Calif. (1973). This results in increased fan out for each node, i.e. more entries per node, and hence reduces the number of disk accesses, on average, required to read a record.
The performance of the B-tree concept is optimal for equally distributed data or with other words the height of the B-tree is minimal in this case. Sorted data may also be stored in a B-tree, but in this case the performance of the B-tree concept is not optimal. A common disadvantage of the basic B-tree and its variances, especially when used for sorted data, consists in the number of split operations which have to be performed when data are subsequently stored in the tree and the low percentage of storage utilization. A low percentage of storage utilization is a major drawback, if high speed searching is to be performed in the tree.