Sorting is the process of ordering items based on specified criteria. In data processing, sorting indicates the sequencing of records using a key value determined for each record. If a group of records is too large to be sorted within available random access memory, then a two-phase process, referred to as external sorting, may be used. In the first phase of the external sorting process, a portion of the records is typically sorted and the partial result, referred to as a sorted run, is stored to temporary external storage. Sorted runs are generated until the entire group of records is exhausted. Then, in the second phase of the external sorting process, the sorted runs are merged, typically to a final output record group. If all of the sorted runs cannot be merged in one pass, then the second phase may be executed multiple times in a process commonly referred to as a multi-pass or multi-phase merge. In a multi-phase merge, existing runs are merged to create a new, smaller replacement set of runs.
The records within a sorted run are typically written to external storage in sequential blocks of data, such that each block includes an integral number of records. The performance of typical merging and forecasting algorithms can be greatly affected by the size of the record block. For example, when sorting randomly ordered records, poor merge performance may result from the selection of small block size because disk latency, which may be orders of magnitude larger than any other delay (e.g., memory access latency) encountered during merging, can dominate processing time. One method of increasing merge performance is to establish a large block size so that access costs (i.e., time spent locating the blocks) are insignificant compared to transfer costs (i.e., time spent reading the blocks.) However, a large block size may also decrease performance by resulting in a multi-pass merge and, consequently, increased processing time and increased temporary storage requirements.
Another method for increasing performance during the merge phase is to eliminate time spent stalled on input (i.e., waiting for a record block to be retrieved from external storage) by reading blocks from storage in advance of their need while the merge is in progress. One algorithm used to achieve such parallelism is referred to as forecasting with floating buffers. This forecasting algorithm, designed to execute concurrently with the merge algorithm, reads blocks in the same sequence that the merge algorithm requires them. A typical forecasting algorithm determines which run to read next by comparing the largest key value of the last block read from each run being merged. The run associated with the smallest such key is the run from which the next block is read. The buffers, into which blocks are read, may be used to read data from any run, and are thus said to float among the runs.