Parallel processor architectures are commonly used to perform a wide array of different computational algorithms. An example of an algorithm that is commonly performed using such architectures is a scan operation (e.g. “all-prefix-sums” operation, etc.). One such scan operation is defined in Table 1.
TABLE 1[I, a0, (a0 ⊕ a1), . . . , (a0 ⊕ a1 ⊕ . . . ⊕ an−1)],
Specifically, given an array [a0, a1, . . . , an-1] and “I” being an identity element for the operator, the array of Table 1 is returned. For example, if the operator “⊕” is an addition operator, performing the scan operation on the array [3 1 7 0 4 1 6 3] would return [3 4 11 11 15 16 22], and so forth. While an addition operator is set forth in the above example, such operator may be any binary associative operator that operates upon two operands.
To efficiently perform such scan operation oil arrays with a large number of elements, the elements may be traversed in a “tree”-like manner. For example, the elements may be viewed as “leaves” which are processed at a first level to generate and temporarily store a second level of elements which include sums of the first elements, etc. Thereafter, such second level of elements may be processed in a similar manner, and so on until a root has been reached.
To accommodate such processing using a parallel processor architecture, each array element is assigned to a particular thread of a processor. There are typically a limited number of processors each with a limited number of threads (that often amount to far less than the number of array elements). Further, since the threads share data from one level to the next, each of the foregoing levels of processing must be completely finished before moving onto the next level, etc.
This, in turn, requires a synchronization at each level of processing. In other words, the scan operation must wait for the threads to be assigned and complete the processing of each of the array elements at a particular level before moving on to the next level. For instance, given 1024 elements that are being operated upon by 32 threads capable of operating on 1 element/clock cycle, the above algorithm must wait 32 clock cycles before moving on to the next level of processing. In use, the foregoing synchronization potentially results in idle threads and additional latency.