The present invention relates generally to sorting of a series of data items. More particularly, the present invention relates to an improved system for sorting of a set of data in a parallel processor, or multiprocessor, environment.
Sorting of data has presented one of the most significant and important problems in many fields including the field of computer science. Often arising as a component of a more general task such as organizing or searching, sorting problems have been ubiquitous. For example, it is frequently necessary in the business world to sort bills, to arrange mail by postal code or to maintain ordered databases. As other examples, efficient computer searching requires a sorted list, and computer operating systems require process lists sorted by run-times in order to provide efficient scheduling of tasks.
As a result, many computer algorithms have been proposed to efficiently sort sets of data elements. Some of these algorithms, such as the well known quick sort and merge sort, are designed to work in a sequential processor environment, in which sorting is performed by a single central processing unit (CPU). With the use of only one single processor at a time, those skilled in the art will recognize that the theoretical limit of complexity, or minimum number of required operations, that can be achieved in sorting n items using comparison-based sorting is on the order of O(nlog.sub.2 n). By the same token, the speed with which these algorithms are executed is also limited by the speed of the single CPU performing the work. Consequently, for sorting large sets of data, the sequential processing environment offers little utility.
A parallel processor (or multiprocessor) environment offers the ability to increase computing throughput so that large problems can be solved in a reasonable amount of time. Parallel processing thus offers the ability to more efficiently sort large sets of data. Generally speaking, two structures exist for enabling communication between processors in a parallel processor environment, message passing and shared memory. In a message passing structure, each CPU is associated with an independent memory, such as random access memory (RAM), and information is passed from processor to processor by hard wire connections between the processors. A message passing structure, which is a true network, thus generally requires specialized hardware, or is at least optimized for use in a specialized hardware environment. In contrast, in a shared memory structure, multiple CPUs are hard-wired through a bus to a single shared memory bank, and the processors read from and write to particular areas of that memory. Due to the substantial absence of specialized hard-wire connections between processors for optimal performance, the shared memory structure enables simpler, more efficient swapping of data and scaling of operations.
The present invention relates to a improved system for sorting a sequence of substantially 2.sup.k randomly ordered "keys" or data elements in a parallel processing structure. In 1968, while working for Goodyear Aerospace Corporation, Kenneth E. Batcher first proposed a sorting network that employed a parallel processor structure. Batcher's network (which, for reference, may be referred to as a "Batcher network") was described in K. E. Batcher, "Sorting networks and their applications," Spring Joint Computer Conference, AFIPS proceedings vol 32, 1968 Washington, D.C.: Thompson, pp. 307-314. The Batcher sorting network was designed to monotonically sort a sequence of data of length 2.sup.k.
Batcher's original network was based on an arrangement of a compare-exchange (CE) modules that each received two inputs, A and B, and produced two outputs, L and H. The L output represented the minimum of inputs A and B, and the H output represented the maximum of inputs A and B. FIG. 1 illustrates the structure of a Batcher sorting network for a data set of 8 elements, as depicted in Harold S. Stone, "Parallel Processing with the Perfect Shuffle," IEEE Transactions on Computers, vol. C-20 number 2, February 1971 IEEE pp. 153-161. Batcher's network essentially operates by rearranging the data into "bitonic" form and then recursively reconstructing the data into a fully sorted monotonic set. As those of ordinary skill in the art will understand, a bitonic sequence is the juxtaposition of two monotonic sequences, one ascending and the other descending, even if the sequence is split anywhere and the two parts interchanged.
Batcher's network was based on a theory that if a bitonic sequence of 2n numbers, a.sub.1, a.sub.2, . . . a.sub.2n, is split into two sequences, EQU min(a.sub.1, a.sub.n+1), min(a.sub.2, a.sub.n+2), . . . min(a.sub.n, a.sub.2n) (1)
and EQU max(a.sub.1, a.sub.n+1), max(a.sub.2, a.sub.n+2), . . . max(a.sub.n, a.sub.2n) (2)
then each of these sequences is also bitonic and no number of (1) is greater than any number of (2). Based on this theory, Batcher determined that a bitonic sorting network for 2n numbers can be constructed from n CE elements and two bitonic sorters for n numbers. In turn, by using recursive construction, Batcher discovered that a number of such CE modules properly configured in a message passing structure could enable a given sized data set to be monotonically sorted.
As shown in FIG. 1, the Batcher sorting network employs a series of CE stages or ranks, where, at each stage, the data is placed in a specified permutation and then pairs of data elements are compared and exchanged in a predetermined fashion. By design, most of the CE modules in the Batcher network sort data low-to-high, so that the top output of the module is the low element L of the pair and the bottom output is the high element H of the pair. In order to achieve proper sorting, however, Batcher required certain of the CE modules to produce reversed outputs, where the top output was the high H of the pair, and the bottom output was the low L of the pair. For instance, in FIG. 1, the shaded CE modules represent the reversed CE outputs required by Batcher, whereas the unshaded modules represent ordinary low-to-high outputs.
To sort an array A[0 . . . N-1)] of N keys in non-decreasing order, where N=2.sup.k, Batcher's network could be implemented on any ideal parallel random access machine (PRAM) that uses a shared memory architecture. The PRAM would thus have N/2 processors each loaded with the algorithm and each identified by a processor ID (or "PID") ranging from 0 to N/2-1. An algorithm of this type may then appear, for instance, as follows:
TABLE 1 ______________________________________ BATCHER'S BITONIC SORT ______________________________________ Bitonic Sort (A: Array; N: Integer) Var rev, del, Q, R: Integer K1,K2: Key Rflag, Sflag: Boolean Begin Rev: 2 While rev .ltoreq. N do If 2*PID/rev is odd then Rflag := True Else Rflag := False Endif del := rev/2 While del .gtoreq. 1 do Q := PD/del *del R := PID-Q K1 := A[2*Q + R] K2 := A[2*Q + R + del] Sflag := Rflag If K1 &gt; K2 then Sflag := not Sflag End if If Sflag then swap K1 and K2 End if A[2*Q + R] := K1 A[2*Q + R + del] := K2 del := del/2 End {while del .gtoreq. 1} rev := 2*rev End {while rev .ltoreq. N} End {Bitonic Sort} ______________________________________
In this algorithm, 2*Q+R and 2*Q+R+del represent the data movement that would occur if input to the physical Batcher sorting network is treated as an array. Thus, these terms describe the physical hard-wired connections of the Batcher sorting network. As those of ordinary skill in the art will appreciate, the foregoing algorithm works in part by cycling through permutations of the data in a specified sequence, and comparing and exchanging data elements where necessary in order to achieve a monotonically sorted output set.
Significantly, Batcher's bitonic sorting network required the entire sequence of numbers to pass through a total of 1/2(log.sub.2 N).sup.2 stages or ranks of CE modules, giving Batcher's network an order of complexity, or worst-case efficiency, of O(log.sub.2 N).sup.2. Further, Batcher's network would require a total of (p.sup.2 -p+4)p.sup.p-2 CE modules, where N=2.sup.p or p=log.sub.2 N. Additionally, Batcher's network theoretically required potentially a huge number of CPUs to be hard wired together in a predefined configuration for a given sized data set, or, alternatively, custom software to be developed for handling each sized data set. As a result, Batcher's network was impractical for sorting large data sets, because such a network would, at worst, require an exceptionally large number of processors hard wired in a configuration designed to sort and, at best, lead to cumbersome and inefficient simulation by software.
In 1971, Harold Stone described a new bitonic sorting network, the "shuffle-exchange network." Harold S. Stone, "Parallel Processing with the Perfect Shuffle," IEEE Transactions on Computers, vol. C-20 number 2, February 1971 IEEE pp. 153-161. The shuffle-exchange network was said to enable operation of the Batcher sorting algorithm while eliminating all but one of the CE module ranks described by Batcher. In particular, instead of requiring a set of data elements to pass through multiple hard-wired stages of compare-exchange modules, Stone developed a single network, called the "perfect shuffle," through which a set of data can be passed any number of times and by which the set of data could be shuffled into the necessary "Batcher" permutations before subjecting the data set to the comparison-exchanges required by Batcher.
Stone's system was designed to shuffle a set of data, similar to shuffling a deck of cards, in order to change the permutations of adjacent data elements. Further, Stone devised a control mask made of bits that would be input to the CE modules and would dictate whether the outputs of given CE modules would be ordered high-to-low, or low-to-high, so as to enable implementation of the bitonic sequencing required by the Batcher network. In this way, it became theoretically possible to fully execute the Batcher sorting algorithm without the necessity of passing through multiple hard-wired or custom designed stages, but rather with the use of a single generic shuffle-exchange network.
In an effort to maximize sorting efficiency, Stone's shuffle exchange network required a number of processors equal to at least the number of data items to be sorted. The shuffle exchange network then employed a "perfect shuffle" pattern, by which the outputs of each processor are connected to the inputs of specified processors. FIG. 2 depicts the shuffle-exchange network as illustrated by Stone. As shown in FIG. 2, those of ordinary skill in the art will appreciate that, for a data set a.sub.0, a.sub.1, a.sub.2, . . . a.sub.N-1 bearing index numbers i=0, 1, 2, . . . N-1, the perfect shuffle pattern defines the input p(i) of each processor as follows: ##EQU1## By passing data items through the perfect shuffle, the items are thus shuffled like a deck of cards in order to enable subsequent comparisons of data elements to be made in different permutations. According to the shuffle-exchange network, between successive passes through Batcher's CE units, the data is shuffled one or more times by the perfect shuffle to obtain the required starting Batcher permutation on which one or more compare-exchanges are to be performed. Before a data set has been subjected to the shuffle-exchange network, each element or key in the set may be represented by an index number, such that in a set of 8 keys, for example, the keys are initially represented by the indexes 000, 001, 010, 011, 100, 101, 110, and 111. Stone recognized that each time a set of data items is passed through the perfect shuffle, the binary representations of the index numbers for each item are cyclically shifted once to the right. Thus, with a set of 8 numbers for instance, the index numbers shift as follows for successive passes through the perfect shuffle:
TABLE 2 ______________________________________ INDEX NUMBER SHIFTS Decimal Binary After After After Index Index Pass 1 Pass 2 Pass 3 ______________________________________ 0 000 000 000 000 1 001 100 010 001 2 010 001 100 010 3 011 101 110 011 4 100 010 001 100 5 101 110 011 101 6 110 011 101 110 7 111 111 111 111 ______________________________________
Noticeably, after the sequence of numbers passes through the perfect shuffle log.sub.2 N times, the index numbers return to their initial positions. As a result, it can be shown that only log.sub.2 N passes through the perfect shuffle are required in order to arrange the numbers in all necessary permutations, rather than (log.sub.2 N).sup.2 passes through each CE rank as required by Batcher. Thus, given a set of 8 numbers, a total of only 3 passes through the perfect shuffle are required in order to arrange the numbers in all permutations required by Batcher's network.
After successive passes through the perfect shuffle, each pair of index numbers differs by only one bit, representative of a decimal difference of 2.sup.n-m, where m represents the number of passes through the perfect shuffle. Thus, after successive passes through the perfect shuffle, the difference between index numbers of the elements in each pair changes according to the sequence 2.sup.n-1, 2.sup.n-2, . . . 2.sup.0, where n=log.sub.2 N. Take, for instance, 8 data items of which the first two index number pairs are 0-1 and 2-3, or 000-001 and 010-011, as shown in Table 1. Before the data is passed through the perfect shuffle, each pair of index numbers differs by a decimal value of 1, which may be referred as "1-apart." After one pass through the perfect shuffle, the first two pairs become 0-4 and 1-5, or 000-100 and 001-101, so that the indexes differ by a decimal value of 4, which may be referred to as "4-apart." In turn, after a second pass, the indexes differ by a decimal value of 2, which may be referred to as "2-apart." Finally, after another pass through the perfect shuffle, the indexes in each pair again differ by the decimal value of 1 and are therefore again "1-apart."
From another perspective, after each pass through the perfect shuffle, the index numbers of the keys in each pair can be seen to differ in only one bit position. This bit position may be referred to as the "pivot bit" or as the "node number" of the processor network. Thus, before the first pass shown above, each respective index number pair differs in only the 1 pivot bit position (for instance, 010-011); after the first pass, the index number pairs differ in the 4 pivot bit position (for instance, 001-101); after the second pass, the index number pairs differ in the 2 pivot bit position (for instance, 100-110); and after the third pass, the index number pairs again differ in the 1 pivot bit position. Accordingly, at these stages, the pivot bits are respectively 1, 4, 2 and 1. A similar sequence of pivot bits can be derived for a data set of any length. For instance, for a data set of 16 numbers, the sequence of pivot bits would be 1, 8, 4, 2, 1. More generally, for a sequence of 2.sup.k data elements, the corresponding sequence of pivot bits would be 2.sup.0, 2.sup.k, 2.sup.k-1, 2.sup.k-2, 2.sup.k-3, . . . 2.sup.0.
Stone further recognized that, as discussed above, Batcher's sorting network called for a sequence of comparison-exchanges in which the pivot bits of the data elements for each successive comparison follow the sequence i.sub.0, i.sub.1, i.sub.0, i.sub.2, i.sub.1, i.sub.0, . . . , i.sub.m-1, i.sub.m-2, . . . , i.sub.1, i.sub.0. Phrased differently, Batcher's network requires several subsequences of useful comparisons to be performed. The first subsequence calls for a 2.sup.0 -apart permutation. The second subsequence calls for 2.sup.1 -apart and then 2.sup.0 -apart permutations, and the mth subsequence calls for permutations of 2.sup.m-1 -apart, 2.sup.m-2 -apart, . . . , 2.sup.1 -apart, and 2.sup.0 -apart. These subsequences thus must begin with starting permutations having index number differences, or pivot bits, of 2.sup.0, 2.sup.1, . . . , 2.sup.m-2, and 2.sup.m-1.
In contrast, however, as established above, Stone's perfect shuffle gives rise to sequential index number differences of 2.sup.n-1, 2.sup.n-2, 2.sup.n-3, . . . , 2.sup.0, which is the reverse of the order required as starting permutations for Batcher's algorithm. Consequently, in order to map the shuffle-exchange-network onto Batcher's network, Stone recognized that it would be necessary at times to first perform a sequence of redundant shuffles in order to place the data in the appropriate permutation for performing each stage of the Batcher compare-exchange. These shuffles are referred to as "redundant," because the only purpose served by the shuffle is to rearrange the permutation in preparation for subjecting the data set to the compare-exchanges and shuffles required by each stage of the Batcher network. Only after performing any necessary redundant shuffles would the Stone network then perform the series of compare-exchanges required by Batcher. As one example, again assume an 8 element data set. In order to reach a pivot bit of 2 for the start of the second rank of Batcher's network, the data set would have to pass through Stone's perfect shuffle two extra times. Beginning with a pivot bit of 1 (which, as noted above, is the initial pivot bit of an unsorted set), the first pass through the perfect shuffle would change the pivot bit to 4, and the second pass would change the pivot bit to 2. According to Batcher, necessary comparison-exchanges may then be performed on each adjacent pair of data elements for the given rank.
Stone also recognized the above-discussed requirement in Batcher's sorting network to reverse the outputs from certain CE modules. Consequently, in addition to mapping the permutations required for the Batcher algorithm, as noted above, Stone also described a set of mask bits, or signal bits, each of which was to indicate whether a given CE module receiving the respective mask bit would produce outputs that were ordered high-to-low, or low-to-high. Specifically, supplying a mask bit of 0 to a CE module would result in a top output of L and a bottom output of H, whereas supplying a mask bit of 1 to a CE module would result in a top output of H and a bottom output of L. Theoretically, applying a set of appropriate mask bits (also referred to as a control mask) at each stage of the perfect shuffle would then provide the necessary reversal of compare-exchange outputs as required by Batcher's network.
In an effort to develop the appropriate mask bits for each stage of the Batcher sorting network, Stone drew upon the relationship between the pivot bits and the Batcher sorting stage. More particularly, as described above, the pivot bit is substantially unique to each stage in the shuffle-exchange network, and the data in each stage of the shuffle-exchange network may be mapped to the required permutation for Batcher's algorithm by performing a known number of redundant shuffles for that stage. Therefore, Stone theorized that the mask bits required for the sequential Batcher stages could be determined based on the pivot bit of the given stage. To this end, Stone developed the following sorting algorithm for the shuffle-exchange network:
TABLE 3 ______________________________________ SHUFFLE-EXCHANGE ALGORITHM ______________________________________ COMMENT generate initial control mask in 1-apart position; R := vector (0, 1, 0, 1, . . . , 0, 1); mask := R; COMMENT m = log.sub.2 N; For i := 1 step 1 until m do Begin mask := mask .sym. R; Shuffle(mask); End COMMENT the array DATA contains the items to be sorted; COMMENT perform compare-exchange on data in 1-apart position; Compare-Exchange(data) COMMENT start remaining m-1 stages of sorting network; COMMENT this may be referred to as the "control loop"; For i = 1 step 1 until m-1 do Begin COMMENT update mask -- generate mask bits for next stage; Shuffle(R); mask := mask .sym. R; COMMENT perform redundant shuffles to align data to next permutation; COMMENT this may be referred to as the "redundant-shuffle loop"; For j := 1 step 1 until m-1-i do Shuffle(data); COMMENT perform next sequence of compare-exchange operations; COMMENT this may be referred to as the "compare-exchange loop" For j := m-1 step 1 until m do Begin Shuffle(data); Compare-Exchange(data); End; End i loop ______________________________________
As those of ordinary skill in the art will understand from the foregoing, Stone's algorithm would theoretically operate by first generating a mask scaled to the size of the input data set, and, next for a specified number of repetitions, (i) updating the mask based on a control vector, (ii) performing any necessary redundant data shuffles to achieve the required permutation, and (iii) subjecting the data set to a specified number of shuffles and compare-exchange operations.
More particularly, Stone's algorithm would begin by developing a mask scaled to the size of the input data array, say 2.sup.m. The algorithm would generate this starting mask by first setting the mask equal to the string 0, 1, 0, 1, . . . , 0, 1 and then, for m repetitions, XORing the mask with control vector R=(0, 1, 0, 1, . . . , 0, 1) and shuffling the mask through the perfect shuffle. Using the resulting mask, Stone next performed a compare-exchange operation on the input data array set in the 1-apart permutation, that is, prior to shuffling the data set. In turn, for m repetitions, Stone would (i) update the mask by XORing it with a control vector, (ii) perform redundant shuffles on the data set, and (iii) subject the data to the shuffles and compare-exchanges required by Batcher. FIG. 3 sets forth a flow chart depicting these stages in Stone's algorithm.
With reference to the initial mask-generation loop of Stone's algorithm, by generating the initial mask dependent on the length of the data sequence being sorted, Stone's algorithm would further provide scalability. More particularly, by generating an initial mask based on the size of the data set, a rank of processors could in theory be reprogrammed in real time with the compare-exchange settings required to perform a Batcher sort, rather than hard-wiring the settings of the compare-exchange modules at construction time or simulating such a hard-wired environment through complex and inefficient software code. In this way, Stone believed the shuffle-exchange network could be used on any perfect shuffle network without modification. That is, Stone postulated that the shuffle-exchange network would enable any number of items to be sorted without requiring a custom-hardwired or custom-coded configuration.
In theory, Stone's perfect-shuffle-network thus provides the ability to obtain all possible permutations necessary to perform a Batcher sort. Further, as one of ordinary skill in the art would appreciate, the shuffle-exchange network would require substantially fewer processors to operate than Batcher's network. In addition, in Batcher's network, each stage requires N connections to communicate with the next stage, whereas the perfect-shuffle requires only 3 connections per processor to be fully connected. Assuming a construction cost C per connection, the cost for a Batcher network would then be 1/2CN(log.sub.2 N).sup.2, whereas the cost for a shuffle-exchange network operating on the same number of data elements would be only 3CN. In view of these factors, as the size of the data set grows, the cost associated with the Batcher network will grow much faster than cost associated with the shuffle-exchange network.
In 1992, Batcher described an additional network called the "static-perfect-shuffle." K. E. Batcher, "Low-Cost Flexible Simulation with the Static Perfect Shuffle Network," 4th Symposium on the frontiers of Massively Parallel Computation, Mclean Va., 1992 pp. 434-436. The static-perfect shuffle theoretically enables a perfect shuffle to be performed either forward or backward. That is, while the "perfect-shuffle" enabled shuffling of data giving rise to the sequence 1-apart, 2-apart, 4-apart and so on, Batcher envisioned an "unshuffle" operation that would reverse the shuffle sequence. The unshuffle operation would theoretically follow a path directly opposite of that illustrated in FIG. 2. Batcher did not, however, suggest any practical applications for his static-perfect-shuffle network. Further, it is believed that the "unshuffle" operation has not been extended to a data sorting network.