1. Field of the Invention
This invention relates to systems for distributive sorting and, more specifically, to an external Most Significant Digit (MSD) radix sorting system that switches adaptively to comparison-based sorting of distribution bins according to a predetermined bin-count threshold test and immediately replaces sorted records taken from internal memory with new unsorted records to lengthen output strings.
2. Description of the Related Art
Sorting is the process of arranging items in a "order" and is generally acknowledged to be one of the most time-consuming computer-implemented procedures. The procedures known in the art for sorting computer data records according to their associated key fields may be loosely classified in terms of efficiency as a function of the number of records to be sorted.
For small numbers of records (e.g., 2-100), a first class of sorting procedures having minimal overhead steps unrelated to record numbers are most efficient. This first class of procedures includes the insertion, selection and bubble sorts and generally includes the simplest procedures requiring a sorting time proportional to the square of the size of the group of records to be sorted (N.sup.2). A second class of sorting procedures are most efficient for intermediate group sizes of up to 100,000 records. This second class of procedures requires sorting time proportional to Nlog.sub.2 N and includes the Quicksort and Heapsort procedures known in the art. A third class of sorting procedures are generally useful for very large groups of records because they can be efficiently implemented with modern computers and require a sorting time proportional to DN, where D is the number of sort key digits and N is the number of records in the group, but these procedures also require substantial number-independent procedural steps. This class includes the distribution (bin) or "radix" sorting procedure. The numerous sort overhead steps make this class of procedures inefficient for smaller numbers of records.
Classical sorting procedures known in the art may also be characterized according to whether or not the sort is accomplished entirely within the internal memory local to the Central Processing Unit (CPU) executing the sort. An "external sort" denominates the class of sorting techniques applicable to data files that exceed the capacity of primary or internal memory. This class of sorting procedures relies on additional secondary storage, such as Direct Access Storage Devices (DASDs), tapes and drums. In the merge sort, which is one type of external sort procedure, subsets of a file are moved (read) into internal memory, ordered internally, and then rewritten in sorted order to an external device or secondary storage facility. One such technique, the "replacement-selection" sort, produces from the unordered "input file" an intermediate file containing one or more ordered lists or strings of records. Replacement-selection sorting produces ordered strings of varying length, the average length being twice the capacity of the internal memory. The record strings may then optimally be merged into one ordered string by forming a "minimal merge Huffman tree", as is known in the art. Most external sorting methods in the art for data stored on external disk drive are merge-based.
The well-known distributive or radix sorting procedure requires sorting time proportional to DN. Such procedures employ one of two available approaches for recursively distributing records according to their key field values. The keys are distributed to form one or more subgroups and the distribution is collected so as to preserve or maintain an order among the subgroups. The distribution is accomplished by comparing each key against an extrinsic attribute and then assigning the key to a subgroup or bin. The collection sequence preserves the overall order of the key field representations. Each key field is herein presumed to contain D "digits" or bytes, each digit having a radix of M. Each key field in a group of records is then distributed to one of M bins selected in accordance with the value of one of the key digits. After the group is completely distributed among M new subgroups or bins, the distribution process is repeated for each new subgroup in turn for another key digit. The process concludes when the smallest subgroups are transferred in order to an output area. Depending on the approach selected, the distribution of key field representations begins either according to the Least Significant Digit (LSD) or the Most Significant Digit (MSD) in the key field. The only difference between these two approaches is the key field scanning direction.
The MSD radix sort procedure requires that unsorted bins be maintained in storage while the procedure continues recursively through the key field digits until all the records in the first bin in the first rank are sorted and moved to the output area. Then the second bin in the first rank is similarly distributed recursively according to the second MSD, the third MSD, etc., and the third bin in the first rank is similarly distributed and so forth. Every group of record keys generates up to M new subgroups of next rank during distribution (M=the radix of each key digit). Each of the subgroups is then sorted on the next MSD to create a series of lower ranks. The entire process forms a sort tree where the root represents the original group of record keys, the interior nodes represent subgroups subject to further distribution passes and the leaf nodes represent the final single element or LSD sort bins that are moved to the output area. Because the MSD radix sort is a depth-first procedure, the distributed but unsorted bins at each rank must be maintained in memory awaiting completion of the deep distribution pass for each preceding bin sort. This makes the traditional radix bin sort an "internal" sort procedure accomplished entirely within the local internal memory.
Since distribution-based sorting procedures are "internal sorts", they produce a sorted list or "string" of data records with a length equal to the size of available internal memory. When sorting a data file larger than available internal memory space, several strings are produced by a radix bin sort and these must be merged together with a merge sort to produce the final sorted output file. This merge requirement reduces the sorting efficiency proportionally to the log of the number of intermediate strings produced by the distribution-based sorting procedure.
Many practitioners in the art have suggested improvements to computer-implemented external sorting procedures to increase sorting speed. For instance, in U.S. Pat. No. 4,575,798, Eugene E. Lindstrom et al. disclose an external sorting method that employs random sampling of the key fields to develop a template for partitioning in a single pass the unsorted file into equal size partitions of records, each partition being small enough to fit within available internal memory. Their technique is optimized for associative external memory applications.
Reference is made to U.S. Pat. No. 4,210,961, wherein Duane L. Whitlow discloses a classical merge-based external sorting procedure.
Practitioners in the art have also suggested many improvements to the distribution-based internal radix sorting procedure. For instance, in U.S. Pat. No. 4,809,158, Peter B. McCauley proposes the use of an auxiliary table (his "bin used" table) to account for all digit values actually encountered during the current distribution of a group of key field representations. If the number of encountered values is less than a predetermined threshold number, McCauley then sorts the auxiliary table and uses the table entry value to index (point) into the list of subgroups and identify tile non-empty subgroup rather than merely sequentially scanning tile entire list of M subgroups. McCauley adds the additional overhead steps of accumulating and sorting his "auxiliary tables" and indirectly addressing the subgroup list. Also, in copending patent application Ser. No. 07/813,246 filed on Dec. 23, 1991 now U.S. Pat. No. 5,396,622, Kai Wan Lee et al, disclose a MSD sorting procedure using a dynamic branching table to eliminate the steps of collecting empty subgroups during the recursive distribution phase of their sort. The Lee et al. application solves several of the problems seen in the McCauley invention and is entirely incorporated herein by this reference.
Thus, the present practice in the art for sorting large data record files efficiently is to divide the file into partitions small enough to fit within the internal memory, load each partition into memory and perform an efficient internal (e.g., distribution-based) sort on the loaded partition. When the internal sort is completed, the sorted partition is moved to an output area as a single record string. The next partition is then loaded. The process is repeated until all partitions have been sorted into strings. These strings are then processed with an inefficient merge-based procedure to produce the final sorted record file.
There is accordingly a clearly-felt need in the art for an improved external sorting procedure that reduces or eliminates the time required for the less efficient merger of the strings produced by a sequence of efficient internal sorts without replacement-selection (efficient merger of strings made by a sequence of less efficient internal sorts). The efficiency advantages of the many recent improvements to the internal distribution-based sorting procedures are alone insufficient to offset the inefficiencies associated with repeated input and output of record groups and the relatively slow merger of the sorted record strings. These unresolved problems and deficiencies are clearly-felt in the art and solved by this invention in the manner described below.