1. Field of the Invention
The present invention is in the field of data processing, and relates more specifically to systems and methods for efficiently sorting large data files.
2. Description of Related Art
The sorting of data files that are too large to fit all at once in the random access memory of a computer system is known as “external sorting”, because the data file to be sorted resides at least in part on storage media external to the computer's main memory. The speed-limiting factors in external sorting frequently concern aspects of reading and writing data from and to the external media, which are usually the slowest storage components in the system. External sorting was originally developed for data residing on tape drives and other sequentially accessible storage devices, and this technology was extended to disk and drum external storage. See Section 5.4.9 of D. Knuth, The Art of Computer Programming (Vol. 3, Rev. 1998).
There have been further developments to improve the efficiency of disk-based external sorting by taking advantage of the random access capabilities of these media. Commonly assigned U.S. Pat. No. 4,210,961 (Whitlow, et al.) discloses several such techniques.
Under current approaches, as illustrated in FIG. 1, external sorting is typically performed using a “sort-merge” procedure involving steps such as the following:                (1) “Pre-string generation phase”—setup, initialization and planning for the job.        (2) “String generation phase”—(a) reading a core load of data from the input file 100 (called “sortin”) to an input buffer 102; sorting the core-load worth of data using an internal sorting algorithm (quicksort, shellsort, etc.), into an output buffer 104; and writing the sorted core load out as a sorted “string” to a temporary disk file called “sortwork” 105; and (b) repeating step 2(a) until the entire input file has been processed into one or more sorted strings (106, 108, etc.) in sortwork. If the pre-string generation phase (step (1)) showed that the entire input file could be sorted with one core load, then the sorted output is written directly to the output file, called “sortout” (110) and the process is completed. Otherwise, there must be a “merge phase.”        (3) “Merge phase”—(a) (if necessary) successively merging as many strings in sortwork 105 as can be merged at once, in order to form a smaller number of longer sorted strings; (b) (if necessary) repeating step 3(a) until only one further merge pass is needed in order to obtain a single sorted string; and (c) the “final merge” (which, depending on the size of the input file, may be the only merge pass, obviating the foregoing steps 3(a) and 3(b)), in which the last set of strings are merged and the result written, as a fully sorted single string, to sortout 110.        
The prior art sort-merge technique is capable of efficiently ordering large data files that are initially presented in completely random order. The prior art approach, however, in certain aspects, assumes the “worst case” of random input. As a result, it typically requires, for a large input file, the input file 100 (sortin) to be read in its entirety and the sortwork file 105 to be written in its entirety, and then for the sortwork file 105 to be read in its entirety and the output file 110 (sortout) to be written in its entirety. A considerable amount of costly disk activity is thus entailed.
In practice, however, many input files are in a partial “presorted” condition even before they are subjected to the sort operation. Some files are updated in production by appending new records to the end of an already sorted base file, leaving the sorted base data completely intact. Other files are minimally modified between sort jobs, leaving long runs of sorted data in the file. Indeed, on occasion, some input files are provided in 100% sorted order. To “sort” such files could require doing very little, or practically nothing, if it were thought to take advantage of the presort condition already existing in the file. It would be most preferred in such a case if the successive reading and writing to and from disk as noted above could be eliminated or reduced.