1. Field of the Invention
Embodiments of the present invention relate generally to parallel processing and more specifically to an instruction-efficient algorithm for parallel scan using initialized memory regions to replace conditional statements.
2. Description of the Related Art
A parallel scan, also commonly known as a parallel prefix sum when addition is the associative operator, is a useful building block for many other parallel algorithms, such as sorting data and generating data structures. The parallel scan is a particularly useful algorithm for use in modern single-instruction multiple-data (SIMD) processors, such as graphics processing units (GPUs), which are being deployed to solve an increasingly general set of computational tasks.
A conventional parallel scan algorithm distributes a set of N input elements to a set of processing threads, which generate an array of N output elements through one or more processing passes. In a first processing pass, each processing thread typically accesses two values in the array of N input elements to generate one output that is stored in the array of N output elements. In each subsequent processing pass, a given thread conventionally accesses two values in the array of N output elements to generate one new output element. In a typical implementation, each thread retrieves one data element from an array index associated with the thread identification number and one data element from an offset that increases with each processing pass. As each thread computes an offset for each processing pass, the thread needs to avoid exceeding the memory boundaries established for the array of N output elements. To avoid exceeding array boundaries in memory, each thread performs bounds checking using one or more conditional operators. When an index goes out of bounds, an identity value (e.g., zero for addition) is returned for use in any related computations.
In a SIMD processing model, any thread executing a failed conditional operator must execute one or more place holder (null) instructions, to maintain instruction-level synchronization with the remaining threads in the associated thread group. When a conventional parallel scan operation is implemented in a SIMD processing model, the instructions executed to perform bounds checking by each thread in each pass reduce the overall efficiency of every thread in the thread group.
As the foregoing illustrates, what is needed in the art is a technique for efficiently performing a parallel scan operation in a SIMD multi-processor, such as a modern GPU.