1. Field of the Invention
This invention relates to systems for distributive sorting, and more particularly, to efficient Most Significant Byte (MSB) radix sorting through Dynamic Branch Table (DBT) control of sorting bucket ordering.
2. Discussion of the Related Art
Sorting is generally acknowledged to be one of the most time consuming computer-implemented procedures. It has been estimated that over twenty-five percent (25%) of all computer running time is devoted to sorting. Many installations use over half of their available computer time for sorting. Numerous proposals and improvements have been disclosed in the art for the purposes of reducing sorting time and increasing sorting efficiency. Refer to, for instance, Harold Lorin, "Sorting and Sorting Systems", the Systems Programming Series, copyright 1975 by Addison-Wesley Publishing Company, pp. 143-66 chapter on distributive sorts and especially pp. 148-158 on digit sorting for a discussion of efficient large scale sorting methods.
The classical sorting procedure known in the art sorts a group of data records into a sequence specified by an identifying key assigned to each record. For small numbers of records, a first class of sorting procedures having minimal overhead steps unrelated to record numbers are most efficient. This first class includes the insertion, selection and bubble sorts and generally includes the simplest procedures requiring a sorting time proportional to the square of the number of records. A second class of sorting procedures are most efficient for intermediate numbers (up to 100,000) of records. This second class require sorting time proportional to Nlog.sub.2 N and includes Quicksort and Heapsort, which are known in the art. A third class of sorting procedures are generally useful for very large files and require a sorting time linearly proportional to the number of records with substantial number-independent overhead. This third class includes all types of bin or bucket sorting, including radix sorting. The fixed computational overhead unrelated to the number of records makes this third class of sorts inefficient for fewer numbers of records.
The typical bucket sort employs one of two approaches. In these approaches, the distribution of records starts either according to the Least Significant Byte (LSB) or Most Significant Byte (MSB) in the key field (see the Lorin reference cited above). The only difference between these two is the key field scanning direction.
For LSB radix sorting, records are first distributed to buckets according to the LSB value in the key. After this first distribution, the LSB buckets are recombined so that the order of the LSB's is preserved. Then the records are again sequentially distributed to buckets according to the Next Least Significant Byte (NLSB) in the key. This process is repeated until the final distribution pass for the MSB, at which point the records are sorted. The primary drawback of LSB radix sorting is that it is insensitive to the data. The number of distribution passes is constant and equal to the number of bytes in the key, regardless of opportunities for short-cuts arising from the data distribution. The entire key must be scanned even if only a few of the more significant bytes are sufficient to order the records.
With MSB radix sorting, records are first distributed according to the MSB value in the key. Records with the same MSB are grouped within the same bucket. Each of the buckets can then be sorted independently of other buckets without the recombination step needed in the LSB sort. The records in the first MSB bucket are distributed again according to the second MSB. Then the records in the first bucket in the second rank having the same first two MSB's are distributed again according to the third MSB. As the distribution continues down the key ranks, the number of records having identical MSB's within a bucket becomes smaller and smaller. The records within a bucket are completely sorted when either the LSB is examined or the bucket has a single record.
This process continues recursively until all records in the first bucket in the first rank are sorted. Then the second bucket in the first rank is distributed recursively according to the second MSB, the third MSB, etc., the third bucket in the first rank is distributed and so forth.
Because many of the subsequent buckets will have one or no records, the recursive sort sequences will often terminate before LSB examination. Thus, the MSB radix sorting method exploits the data distribution and does not require a constant number of distribution passes. For some data, the MSB radix sort can be significantly more efficient than the LSB radix sort of the same records.
For MSB radix sorting, every bucket generates up to R (R=radix of key bytes=maximum number of possible values) new buckets of next rank during distribution of each key byte value in the present rank. This list of R buckets must then be sequentially scanned to find the non-empty bucket(s) for the distribution of the subsequent ranks. If the number of non-empty buckets actually encountered during the distribution is substantially less than R, then the sequential scan of the entire list of R buckets represents a significant waste of computer processing time.
In U.S. Pat. No. 4,809,158, McCauley proposes the use of an auxiliary table (his "bin used" table) to account for all byte values actually encountered during the current distribution. If the number of encountered values is less than a threshold number, McCauley then sorts the auxiliary table and uses the table entry value to index (point) into the bucket list and identify the non-empty buckets rather than sequentially scanning the entire bucket list. However, McCauley pays the price of the extra processing overhead required to accumulate and sort the auxiliary tables and to index the additional indirect addressing to the list of all possible buckets. Moreover, McCauley's procedure may not be optimally efficient for some data distributions because of the inflexibility of the threshold value he uses to trigger the optional auxiliary table sort.
Refer also to Aho, et al, "Design and Analysis of Computer Algorithms", copyright 1974 by Addison-Wesley Publishing Company, pp. 76-97 and especially pp. 79-84 regarding a radix sort of keys having unequal length using a preprocessing step to avoid the time needed to scan empty buckets. Aho, et al observed that a list of occupied buckets made during the distribution phase can be used to reduce the time necessary to link occupied buckets during the collection phase of a distributed radix sort. Although McCauley improves on Aho, et al by proposing to use such a list for scanning and linking the occupied buckets only when they are in a countably insignificant minority, McCauley's method does not avoid the additional processing overhead required to develop the auxiliary tables suggested by Aho, et al.
There is a clearly felt need in the art to increase the efficiency of sorting techniques applied to very large numbers of records. Until now, no method was known in the art for optimally and dynamically minimizing the collection phase activity in a bucket sort in accordance with an occupied bucket table accumulated during the distribution phase without additional processing overhead. This unresolved deficiency is clearly felt in the art and is solved by the present invention in the manner described below.