1. Field of the Invention
This invention relates to a method for sorting data in a computer data storage system, and more particularly to a method for sorting and compressing data that has particular advantages in implementing a key index tree structure.
2. Description of the Prior Art
In the computer arts, data is typically stored in some form of non-volatile storage system, such as magnetic disks, in the form of data records. Typical operations conducted using such data records are reading of records; deletion of records; modifying and re-writing existing data records; and adding new data records.
For very large data bases, it is extremely inefficient and time consuming to sequentially search all data records in the storage system in order to find a particular record to read, delete, or modify, or to locate the appropriate place to add a record.
A more efficient, but still cumbersome and time consuming, search method requires creating a search key for each data record that uniquely identifies the record. Each search key is associated with a record pointer that indicates the location in the computer storage system of the data record associated with the search key. A common type of pointer is a relative record number. Through the use of such record pointers, the data records themselves need not be kept in sequential order, but may be stored in random locations in the computer storage system. A search for a particular data records is speeded up by sequentially searching a compiled index of such key records (comprising search keys and record pointers), rather than the data records themselves. However, such sequential searching is still relatively slow.
A much more efficient search method for such a key index is to create a "tree" structure, rather than a sequential file, for the key records. One such tree structure is a "B-tree", an example of which is shown in FIG. 1. The use of B-trees to structure indexes for data files in computer data storage systems is well known in the prior art. (See, for example, Knuth, The Art of Computer Programming, Vol. 3, pp. 473-479). A B-tree consists of nodes which can be either leaf nodes or branch nodes. A branch node contains at least one search key and related pointers (such as relative node numbers) to other nodes. A leaf node contains at least one search key and pointers to data records. One node in the tree is the root node, which can be either a leaf node (only for a tree with a single node) or a branch node. The "height" of a tree is equivalent to the number of nodes traversed from the root node to a leaf node. Searching for a data record is accomplished by comparing a key to the contents of the root node, branching to branch nodes based on such comparisons, comparing the key to the contents of such branch nodes, and continuing "down" the height of the tree until a leaf node is reached. The key is compared to the contents of the leaf node, and one of the pointers in the leaf node is used to fetch the desired data record (if one exists).
In the most simple B-tree, each node contains one search key and two associated pointers. Such a tree structure, sometimes referred to as a binary tree, theoretically provides a very efficient search method. If the number of nodes in this type of tree is equal to or less than 2.sup.n, then only "n" searches are required to locate a data record pointer in any leaf node.
In practice, a simple binary tree is inefficient. Most data bases are stored on relatively slow storage systems, such as magnetic disks. The time required to access any item of data (such as a tree node) on such a storage device is dominated by the "seek" time required for the storage unit to physically locate the desired storage address. Following each seek, the contents of a node may be read into the high-speed memory of the computer system. In a simple-binary tree, for each access of a node, only a two-way decision (to the left or right branch from that node) can be made since the node contains only one search key. If, instead of containing only one search key per node, a node contains several search keys, then for each seek operation, several keys will be read into the high speed memory of the computer system. With one search key per node, a comparison and determination can be made that the item sought for is in one half of the remainder of the tree. With "n-1" search keys per node, the search can be narrowed to "1/n" of the remainder of the tree (for example, with 9 search keys per node, a search can be narrowed to 1/10 of the remainder of the tree). This type of structure is known in the prior art as a "multiway" tree.
It is advantageous to have as many search keys as possible per node. Thus, for each seek of a node, several search keys can be examined and a more efficient determination can be made as to the location of the next node or, in the case of a leaf node, of a data record. The height of the tree, and hence the search time, is dramatically decreased if the number of search keys per node is increased.
A very efficient method of searching large storage disk-based key indexes based on this concept is described in U.S. Pat. No. 4,677,550, entitled "METHOD OF COMPACTING AND SEARCHING A DATA INDEX", which issued on Jun. 30, 1987 to the inventor. By the use of a tree structure called a "Bit-tree", the search keys in leaf nodes are compacted such that a much larger percentage of the key records of the tree structure are located in leaf nodes. Searching for a data record is accomplished in essentially the same manner as for B-trees, but the height of the search tree is substantially reduced, permitting faster fetching of the desired data record.
Although B-trees and Bit-trees represent efficient methods of searching for data records, both methods require initial generation of the tree structure itself. A necessary operation needed for maintenance of an existing tree structure of either type is the ADD RECORD operation (which actually adds to the tree a key record), the methods of which are well known. Therefore, a tree can be initially built simply by "adding" a first key record to an empty tree, then sequentially adding further key records until all key records have been added to the tree.
It is known that a tree can be built much more efficiently if the key records are initially physically sorted in the order of their key field than if the records are in random order. Therefore, it is common for many systems to physically sort sets of key records first, and then build a tree based on keys extracted at intervals from the sorted key records.
Sorting of key records for large data bases (i.e., data bases that require storage outside of the main memory of a computer) is typically accomplished in a two-step process. First, the data records are read and key records formed and stored in memory. The key records are "pre-sorted" within the memory and then written out to a storage system as a sorted "string" of key records, typically into an unused portion of the storage system. This production of sorted strings continues until all of the original data records have been read and their key records sorted into one of such strings. Examples of such strings are shown in FIG. 2, labeled as "input strings".
After the generation of all necessary strings, at least two strings at a time are read back into memory and then merged into sorted order (this example is of 2-way merging; it is known in the art to extend this concept to N-way merging). An example of this process is diagrammatically shown in FIG. 2. The merged string is then written out to the storage system. Such merging continues for subsequent passes until only a single, sorted string remains that contains all of the key records.
This process of building a tree by physically sorting key records and then adding the sorted key records to the tree structure is inefficient when considered in light of the desired result. The goal is to build a tree; the ideal method would be to directly build the tree more efficiently than with the two-step "sort and add" method, or with the simple sequential addition method. In addition, the prior art generally teaches that to sort a data file having "N" records, the storage system must have space available to store "2N" records during the sorting process.
The present invention accomplishes this goal by means of a new sorting method that sorts extracted key records into a linked list structure that can be directly transformed into an index tree. The inventive sorting method also may be used simply for sorting large sets of data records in place on a computer storage system.
It is also known in the prior art to compress key records after sorting the keys and before (or during) tree building for the purpose of decreasing their size. One method of compression is described in U.S. Pat. No. 4,677,550 referenced above. Two other means commonly used for compressing key records are prefix compression and suffix compression.
All of these compression techniques are employed after sorting the key records. For prefix compression and suffix compression, similarities between the leading characters and the trailing characters, respectively, of ordered search keys are exploited to reduce the size of the search key, thereby making it possible to increase the number of search keys in a node.
Prefix compression reduces the number of characters in a search key by eliminating leading characters of a key that are common to a preceding search key. For example, if two adjacent search keys are "MAYER" and "MAYERS", the leading characters, "MAYER", can be eliminated from the second search key and replaced by an indication of the number of leading characters so eliminated. Hence, the second search key could be replaced with the compressed search key "5,S". Conventionally, the number "5" is placed in a separate prefix field ("P-field") in the compressed search key record.
The number "5" in the P-field indicates that the first 5 characters of that search key are identical to the first 5 characters of the preceding search key. Thus, the current search key can be completely reconstructed by reading the first 5 characters of the preceding search key.
Suffix compression eliminates trailing characters which are unnecessary for determining the relative position of a search key among other search keys. For example, if a first search key is "TIMECLOCK" and the next ordered search key is "TIMESHARE", the second search key can be truncated just after the first character ("S") that distinguishes the second key from the first key. That is, the second search key can be compressed to "TIMES". Conventionally, the size of the remaining key is placed in a separate suffix field ("S-field") in the compressed search key record for ease of computing the length of the key record.
Using both prefix and suffix compression techniques, the search key "TIMESHARE" following the search key "TIMECLOCK" could be compressed to "4,1,S", yielding a savings of 6 characters.
As noted above, suffix compression and prefix compression are conventionally done after key sorting, during the tree-building stage. The prior art three-step process (sort, compress, and tree build) required when key records are to be compressed is an inefficient means to achieve the goal of creating a tree with compressed search keys in each node. It would be desirable to improve the efficiency of this process.
The inventive sorting and compression method integrates compression of key records into a new sorting method to yield greater efficiency, and thus overcome the limitations of the prior art.