1. Field of the Invention
This invention relates to a method for sorting data in a computer data storage system, and more particularly to a method for sorting data that has particular advantages in implementing a key index tree structure.
2. Description of the Prior Art
In the computer arts, data is typically stored in some form of non-volatile storage system, such as magnetic disks, in the form of data records. Typical operations conducted using such data records are reading of records; deletion of records; modifying and re-writing existing data records; and adding new data records.
For very large data bases, it is extremely inefficient and time consuming to sequentially search all data records in the storage system in order to find a particular record to read, delete, or modify, or to locate the appropriate place to add a record.
A more efficient, but still cumbersome and time consuming, search method requires creating a search key for each data record that uniquely identifies the record. Each search key is associated with a record pointer that indicates the location in the computer storage system of the data record associated with the search key. A common type of pointer is a relative record number. Through the use of such record pointers, the data records themselves need not be kept in sequential order, but may be stored in random locations in the computer storage system. A search for a particular data record is speeded up by sequentially searching a compiled index of such key records (comprising search keys and record pointers), rather than the data records themselves. However, such sequential searching is still relatively slow.
A much more efficient search method for such a key index is to create a "tree" structure, rather than a sequential file, for the key records. One such tree structure is a "B-tree", an example of which is shown in FIG. 1. The use of B-trees to structure indexes for data files in computer data storage systems is well known in the prior art. (See, for example, Knuth, The Art of Computer Programming, Vol. 3, pp. 473-479). A B-tree consists of nodes which can be either leaf nodes or branch nodes. A branch node contains at least one search key and related pointers (such as relative node numbers) to other nodes. A leaf node contains at least one search key and pointers to data records. One node in the tree is the root node, which can be either a leaf node (only for a tree with a single node) or a branch node. The "height" of a tree is equivalent to the number of nodes traversed from the root node to a leaf node. Searching for a data record is accomplished by comparing a key to the contents of the root node, branching to branch nodes based on such comparisons, comparing the key to the contents of such branch nodes, and continuing "down" the height of the tree until a leaf node is reached. The key is compared to the contents of the leaf node, and one of the pointers in the leaf node is used to fetch the desired data record (if one exists).
In the most simple B-tree, each node contains one search key and two associated pointers. Such a tree structure, sometimes referred to as a binary tree, theoretically provides a very efficient search method. If the number of nodes in this type of tree is equal to or less than 2.sup.n, then only "n" searches are required to locate a data record pointer in any leaf node.
In practice, a simple binary tree is inefficient. Most data bases are stored on relatively slow storage systems, such as magnetic disks. The time required to access any item of data (such as a tree node) on such a storage device is dominated by the "seek" time required for the storage unit to physically locate the desired storage address. Following each seek, the contents of a node may be read into the high-speed memory of the computer system. In a simple binary tree, for each access of a node, only a two-way decision (to the left or right branch from that node) can be made since the node contains only one search key. If instead of containing only one search key per node, a node contains several search keys, then for each seek operation, several keys will be read into the high speed memory of the computer system. With one search key per node, a comparison and determination can be made that the item sought for is in one half of the remainder of the tree. With "n-1" search keys per node, the search can be narrowed to "1/n" of the remainder of the tree (for example, with 9 search keys per node, a search can be narrowed to 1/10 of the remainder of the tree). This type of structure is known in the prior art as a "multi-way" tree.
It is advantageous to have as many search keys as possible per node. Thus, for each seek of a node, several search keys can be examined and a more efficient determination can be made as to the location of the next node or, in the case of a leaf node, of a data record. The height of the tree, and hence the search time, is dramatically decreased if the number of search keys per node is increased.
A very efficient method of searching large storage diskbased key indexes based on this concept is described in U.S. Pat. No. 4,677,550, entitled "Method of Compacting and Searching a Data Index", which issued on Jun. 30, 1987 to the inventor. By the use of a tree structure called a "Bit-tree", the search keys in leaf nodes are compacted such that a much larger percentage of the key records of the tree structure are located in leaf nodes. Searching for a data record is accomplished in essentially the same manner as for B-trees, but the height of the search tree is substantially reduced, permitting faster fetching of the desired data record.
Although B-trees and Bit-trees represent efficient methods of searching for data records, both methods require initial generation of the tree structure itself. A necessary operation needed for maintenance of an existing tree structure of either type is the ADD RECORD operation (which actually adds to the tree a key record), the methods of which are well known. Therefore, a tree can be initially built simply by "adding" a first key record to an empty tree, then sequentially adding further key records until all key records have been added to the tree.
It is known that a tree can be built much more efficiently if the key records are initially physically sorted in the order of their key field than if the records are in random order. Therefore, it is common for many systems to physically sort sets of key records first, and then build a tree based on keys extracted at intervals from the sorted key records.
Sorting of key records for large data bases (i.e., data bases that require storage outside of the main memory of a computer) is typically accomplished in a two-step process. First, the data records are read and key records formed and stored in memory. The key records are "pre-sorted" within the memory and then written out to a storage system as a sorted "string" of key records, typically into an unused portion of the storage system. This production of sorted strings continues until all of the original data records have been read and their key records sorted into one of such strings. Examples of such strings are shown in FIG. 2, labeled as "input strings".
After the generation of all necessary strings, at least two strings at a time are read back into memory and then merged into sorted order (this example is of 2-way merging; it is known in the art to extend this concept to N-way merging). An example of this process is diagrammatically shown in FIG. 2. The merged string is then written out to the storage system. Such merging continues for subsequent passes until only a single, sorted string remains that contains all of the key records.
This process of building a tree by physically sorting key records and then adding the sorted key records to the tree structure is inefficient when considered in light of the desired result. The goal is to build a tree; the ideal method would be to directly build the tree more efficiently than with the two-step "sort and add" method, or with the simple sequential addition method. In addition, the prior art generally teaches that to sort a data file having "N" records, the storage system must have space available to store "2N" records during the sorting process.
The present invention accomplishes this goal by means of a new sorting method that sorts extracted key records into a linked list structure that can be directly transformed into an index tree. The inventive sorting method also may be used simply for sorting large sets of data records in place on a computer storage system.