1. Technical Field
This invention generally relates to the use of multiple processors to merge sorted lists, and more specifically relates to the efficient partitioning of sorted lists containing duplicate entries making use of all the processors during the final sort and merge of the lists.
2. Background Art
The development of the EDVAC computer system of 1948 is often cited as the beginning of the computer era. Since that time, computer systems have evolved into extremely sophisticated devices, and computer systems may be found in many different settings. Computer systems typically include a combination of hardware (e.g., semiconductors, circuit boards, etc.) and software (e.g., computer programs). As advances in semiconductor processing and computer architecture push the performance of the computer hardware higher, more sophisticated computer software has evolved to take advantage of the higher performance of the hardware, resulting in computer systems today that are much more powerful than just a few years ago.
One way in which computers have become more powerful is in their ability to work synchronously in performing tasks. For instance, sorting large volumes of data into an order of ascending or descending values can be done faster when using multiple processors, or xe2x80x9cmultiprocessors,xe2x80x9d and has been the subject of much study. It is a relatively simple process to initially divide the list of N elements into P nearly equal sized lists for each of P processors to individually sort. Once these lists are sorted, they must be merged together into one list. It is desirable that the time required to perform this merge be linear with respect to the number of processors, (i.e. 2 processors yield twice the performance of 1 processor, 4 processors yield quadruple the performance of 1 processor). In other words, with N total elements to be merged using P processors, the time to merge should be proportional to N/P. This is where most of the prior art has run into difficulties. The time to merge using the most efficient prior art method is proportional to N/P*log2P. To illustrate, consider what happens to merge performance as a computer system is upgraded from 2 to 4 processors. If the merge performance is linear then the merge time goes from N/2 to N/4 for a twofold performance increase. If the merge performance is N/P*log2P then the merge time goes from N/2*log22=N/2*1=N/2 to N/4*log24=N/4*2=N/2 for no performance increase even though two additional processors were used.
One reference, Akl, S. G., and Santroo, N., xe2x80x9cOptimal Parallel Merging and Sorting Without Memory Conflictsxe2x80x9d IEEE Trans. on Computers, Vol. C36, No. 11, November 1987, pp. 1367-69, describes a system that can work on two lists at a time, finding the value of the element where the sum of the number of elements in the two lists below the value is equal to the sum of the number of elements in the two lists above the value. It is then a simple matter to partition the two lists at that value, and one processor sort the combined lists below the partition, while the other processor sorts the combined lists above the partition. The two lists are then concatenated, or simply strung together to form one new list sorted in the correct order. If there are more than two lists, the process is done simultaneously for each pair of lists, then one is left with half the number of lists. These lists are partitioned again to generate two pairs of lists as above, but, since the number of lists to be merged is now only half the original quantity, each of the two pairs of lists must be partitioned yet another time so that enough pairs of lists are available to keep all the processors busy. A pair of lists is first partitioned in half and then each half is partitioned to yield a one-quarter and three-quarter partition so that each processor has a pair of lists to merge. This clearly is inefficient since multiple merge phases are necessary to generate the final sorted list. If it were possible to partition all the lists at once into P partitions such that each processor could perform only a single merge phase, then a significant performance increase would result and the desired sorted list would be generated by simple concatenation of the merged partitions. This has been attempted by partitioning based on sample sort keys.
Another reference, Quinn, M. J., xe2x80x9cParallel Sorting Algorithms for Tightly Coupled Multi-processorsxe2x80x9d, Parallel Computing, 6, 1988, pp. 349-367., chooses partitioning key values from the first list to be merged such that these keys will partition the first list into P partitions. These keys are then used to partition the remaining lists as close as possible to the sample key values. This leads to a load imbalance among processors due to the approximated partitions and a corresponding degradation from linear merge performance. Furthermore, if the data in the lists is skewed from random distribution (a common occurrence in relational data base operations), then the resulting approximate partitions can be greatly differing in size thereby exacerbating the load imbalance problem. Even with randomly distributed data the literature indicates that no performance is gained beyond 10 processors when sorting 10,000 records. This is attributed to load imbalances from approximate partitioning.
The prior art has succeeded in efficiently partitioning any number of sorted lists, when the lists do not contain duplicate entries. U.S. Pat. No. 5,179,699, Iyer et al, Jan. 12, 1993 xe2x80x9cPartitioning Of Sorted Lists For Multiprocessors Sort and Merge,xe2x80x9d teaches such a method. When given a large list to sort, the list is initially divided amongst available processors. Each processor sorts one of these lists. The sorted lists are then partitioned to create new lists such that each of the elements in the new lists have values no smaller than any of the elements in the partitions before it, nor larger than any of the elements in the partitions following it. The lists in each partition are then simply strung together to provide a sorted list of all the elements. Maximum use of processors is obtained, and hence a near linear improvement in sort time is obtained when adding further processors.
The Iyer method works fine when all of the elements in the sorted lists are unique. After the list to be sorted is divided between xe2x80x9cxxe2x80x9d processors and individually sorted by each one, xe2x80x9cxxe2x88x921xe2x80x9d processors then runs a partitioning process to create xe2x80x9cxxe2x88x921xe2x80x9d partition boundaries, or xe2x80x9cxxe2x80x9d partitions. Each processor performs the same partition process on the sorted lists. What differs amongst the processors is the percentage of elements that the processor assigns to the upper and lower partitions.
For instance, given 4 sorted lists, 3 partition boundaries are drawn to create 4 partitions. Three processors then create partition boundaries: the first assigns xc2xc of the elements to the upper boundary and the rest to the lower; the second assigns xc2xd of the elements to the upper boundary and the rest to the lower; and the third pro cessor assigns xc2xe of the elements to the upper boundary and the rest to the lower. In this way, the Iyer method creates 4 partitions which then can be concatenated together to produce one final sorted list.
The Iyer method works inconsistently if duplicate entries are contained in the sorted lists. Sometimes the method works, but there are cases where duplicate elements are inserted into more than one partitioned list. Essentially what happens is that the partition boundaries cross, and the elements where the partition boundaries cross are counted more than once. Crossing of boundaries occurs when the partitioning process running separately on each processor includes elements in more than one partition. For instance, the partition boundaries might be drawn such that both partitions 1 and 2 contain the 6th element in the first list.
Whether the Iyer method works to sort lists given duplicate entries depends upon the location of the duplicate entries and the number of duplicates. The reason boundaries sometimes cross is that the Iyer method handles duplicate elements as equals in the partition process. Without a means to differentiate between the duplicate entries, each processor performing the partitioning process may not draw boundaries across the duplicate entries in the same manner. Yet, duplicate entries in large sorted lists are commonplace for many applications today. Thus, there exists a need to provide an improved method to partition sorted lists containing duplicate entries, and a method that remains more efficient when adding further processors.
According to a preferred embodiment, a method of sorting a list of elements with duplicate entries using multiple processors is disclosed. Using xe2x80x9cPxe2x80x9d processors, a list of elements is split into P lists and each processor pre-sorts a list. All pre-sorted lists are lined up to form a partitioning table, with each pre-sorted list making up a column in the table, and the first element from each pre-sorted list making up the first row in the table, and the second element from each pre-sorted list making up the second row, etc.
Pxe2x88x921 partition boundary lines are drawn through the partition table to create P equally sized partitions. Each partition boundary line is drawn such that every element below the line has a value larger than any element above the line, and every element above the line has a value smaller than any element below the line. Duplicate elements are uniquely xe2x80x9cweightedxe2x80x9d during the partitioning process. Thus, with respect to duplicate elements, each partition boundary lines is drawn such that every duplicate element below the line weighs more than any duplicate element above the line, and every duplicate element above the line weighs less than any duplicate element below the line.
Each processor finds a different partition boundary. The first processor finds a boundary with 1/P elements above its partition line, the second finds a boundary with 2/P elements above its line, and so on. In this manner, tabularized pre-sorted lists are grouped into P partitions, which are merged and re-sorted into P sorted lists. Finally, the P sorted lists are simply strung together to provide a sorted list of all elements. Maximum use of P processors is obtained, and a near linear improvement in sort time is obtained when adding further processors. The foregoing and other features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings.