The present invention relates to the use of multiple processors to merge sorted lists, and in particular to the efficient partitioning of sorted lists to make use of all the processors during the final sort and merge of the lists.
The sorting of lists into an order of ascending or descending values making maximum use of multiple processors during the entire sorting process has been the subject of much study. It is a relatively simple process to initially divide the list of N elements into P nearly equal sized lists for each of P processors to individually sort. Once these lists are sorted, they must be merged together into one list. It is desirable that the time required to perform this merge be linear with respect to the number of processors, P (i.e. 2 processors yield twice the performance of 1 processor, 4 processors yield quadruple the performance of 1 processor). In other words, with N total elements to be merged using P processors, the time to merge should be proportional to N/P. This is where the prior art has run into difficulties. The time to merge using the most efficient prior art method is proportional to N/P*log.sub.2 P. To illustrate, consider what happens to merge performance as a computer system is upgraded from 2 to 4 processors. If the merge performance is linear then the merge time goes from N/2 to N/4 for a twofold performance increase. If the merge performance is N/P*log.sub.2 P then the merge time goes from N/2*log.sub.2 2=N/2*1=N/2 to N/4*log.sub.2 4=N/4*2=N/2 for no performance increase even though two additional processors were used!
One reference, Akl, S. G., and Santroo, N., "Optimal Parallel Merging and Sorting Without Memory Conflicts" IEEE Trans. on Computers, Vol. C36, No. 11, Nov. 1987, pp. 1367-69, can work on two lists at a time, finding the value of the element where the sum of the number of elements in the two lists below the value is equal to the sum of the number of elements in the two lists above the value. It is then a simple matter to partition the two lists at that value, and one processor sort the combined lists below the partition, while the other processor sorts the combined lists above the partition. The two lists are then concatenated, or simply strung together to form one new list sorted in the correct order. If there are more than two lists, the process is done simultaneously for each pair of lists, then one is left with half the number of lists. These lists are partitioned again to generate two pair of lists as above, but, since the number of lists to be merged is now only half the original quantity, each of the two pair of lists must be partitioned yet another time so that enough pairs of lists are available to keep all the processors busy. A pair of lists is first partitioned in half and then each half is partitioned to yield a one-quarter and three-quarter partition so that each processor has a pair of lists to merge. This clearly is inefficient since multiple merge phases are necessary to generate the final sorted list. If it were possible to partition all the lists at once into P partitions such that each processor could perform only a single merge phase, then a significant performance increase would result and the desired sorted list would be generated by simple concatenation of the merged partitions. This has been attempted by partitioning based on sample sort keys. Another reference, Quinn, M. J., "Parallel Sorting Algorithms for Tightly Coupled Multi-processors", Parallel Computing, 6, 1988, pp. 349-367., chooses partitioning key values from the first list to be merged such that these keys will partition the first list into P partitions. These keys are then used to partition the remaining lists as close as possible to the sample key values. This leads to a load imbalance among processors due to the approximated partitions and a corresponding degradation from linear merge performance. Furthermore, if the data in the lists is skewed from random distribution (a common occurrence in relational data base operations), then the resulting approximate partitions can be greatly differing in size thereby exacerbating the load imbalance problem. Even with randomly distributed data the literature indicates that no performance is gained beyond 10 processors when sorting 10,000 records. This is attributed to load imbalances from approximate partitioning.