Field of the Invention
Embodiments of the present invention relate generally to computer science and, more specifically, to a single-pass parallel prefix scan with dynamic look-back.
Description of the Related Art
Prefix scan is a well-known computation primitive used in a wide variety of areas. Some notable applications of prefix scan include adders, recurrence solvers, cooperative allocation, compaction, run-length encoding, and duplicate removal. In operation, given a list of input elements and a binary reduction operator, a prefix scan produces a corresponding output list where each output is the reduction of the elements occurring earlier in the input list. An inclusive scan indicates that the nth output reduction incorporates the nth input element. Similarly, an exclusive scan indicates the nth output reduction does not incorporate the nth input element. For instance, a “prefix sum” is a prefix scan in which the binary reduction operator is an addition operation. Consequently, in a prefix sum, each output number is the sum of the corresponding numbers occurring previously in the input list. Thus, for an input list of [8, 6, 7, 5, 3, 0, 9], the inclusive prefix sum is [8, 14, 21, 26, 29, 29, 38] and the exclusive prefix sum is [0, 8, 14, 21, 26, 29, 29].
In many computer systems, the overall time required to execute a prefix scan is bounded by the time required to execute the memory access operations. Consequently, decreasing the number of memory access operations has the desirable effect of increasing performance and decreasing power consumption.
To scan “n” input elements, the theoretical minimum number of memory accesses is 2*n. Each of the n inputs must be read from memory and each of the n outputs must be written to memory. This lower bound is achieved by the typical sequential implementation of prefix scan, which requires only a single pass through the data. A processor iterates over the input list while accumulating a running aggregate. Before each input is accumulated, the processor assigns the current value of the running aggregate to the corresponding exclusive scan output. The processor performs n−1 reduction operations, n input data read operations, and n output data write operations.
For multi-processor systems, a parallel implementation allows the system to utilize more than one processing element when computing a prefix scan. One common parallel implementation of prefix scan is the “reduce-then-scan” approach. Although this method requires two global passes through the data, it achieves high processor and memory bandwidth utilization. A multi-processor system partitions the input data into segments and assigns each segment to a processor included in the multi-processor system. In the reduction pass, processors operate in parallel where each processor computes a reduction of the associated segment. Subsequently, the multi-processor system computes an exclusive prefix scan of the (much-smaller) list of per-segment aggregates. The result is a corresponding list of per-segment prefixes. In the scan pass, processors again operate in parallel where each processor computes a prefix scan across the associated segment, seeded with the appropriate exclusive prefix. The reduce-then-scan implementation performs approximately 2*n reduction operations (n−1 per pass), approximately 2*n data read operations (n per pass), and approximately n output data write operations. Thus the number of memory access operations is approximately 3*n, a 1.5× increase versus the theoretical minimum.
Prior efforts at constructing single-pass parallel implementations of prefix scan having approximately the same 2*n memory workload as the theoretical minimum have suffered from processor and memory bandwidth underutilization. These “chained-scan” methods operate by partitioning the input data into segments which are assigned to parallel processors in which each processor reads its associated segment from global memory into its own local memory. Processors proceed in parallel during a local reduction pass in which each processor computes a per-segment aggregate. Each processor then waits on the processor assigned to the preceding segment to communicate a running prefix aggregate. Once the running prefix is made available, that processor then combines the prefix with its per-segment aggregate and then communicates the updated running prefix to the next processor. Processors that have received their per-segment exclusive prefix are then able to proceed in parallel during a local scan pass in which each processor performs a scan across its local segment, seeded with the exclusive segment prefix. The results are then written out to global memory. The serial dependences between processors cause chained waiting, which prevents high overall system utilization. The performance of the overall computation is thus limited by the latency of inter-processor signaling instead of aggregate memory system bandwidth.
Accordingly, what is needed in the art is a more effective approach to performing a parallel prefix scan