1. Field of the Invention
Embodiments of the present invention relate generally to parallel processing and more specifically to a work-efficient parallel prefix sum algorithm for graphics processing units.
2. Description of the Related Art
A typical computer system includes, without limitation, a central processing unit (CPU), a graphics processing unit (GPU), a display device, and one or more input devices. The user interacts with a software application executing within the computer system by operating at least one input device and observing the results on the display device. The CPU typically executes the overall structure of the software application and configures the GPU to perform specific tasks. In current technology, the CPU tends to offer more general functionality using a relatively small number of large execution threads, while the GPU is capable of very high performance using a relatively large number of small, parallel execution threads on dedicated hardware processing units.
A typical software application may include certain functionality designed to execute on the CPU, while other functions execute on the GPU. For example, the CPU may be configured to run the graphical user interface (GUI) for the application and perform certain application-specific logic, whereas the GPU may be configured to perform computationally intensive tasks, such as rendering graphics images. Software applications typically execute as much computation on the GPU as possible to improve overall system performance. However, certain types of common operations are not easily or efficiently mapped to the parallel architecture of the GPU. When the application performs a computation that does not have an efficient mapping to the parallel architecture of the GPU, a “work-inefficient” processing step is commonly needed, wherein the GPU processes related data with relatively low overall processor utilization for the duration of the processing step. Alternately, the CPU may perform the processing step instead of the GPU. Whenever the GPU processor utilization is low or the CPU needs to perform certain processing steps for the GPU, overall performance and efficiency are reduced.
As is well known, one common processing step used in a wide range of applications is a “prefix sum” operation. A prefix sum operation generates a list that is a running accumulated sum over a list of elements. For example, the prefix sum of list: {1, 2, 3, 4} is the list: {1, 1+2, 1+2+3, 1+2+3+4}, or simply: {1, 3, 6, 10}. In conventional systems, running prefix sum operations on GPUs is inherently work-inefficient. Therefore, each time a prefix sum operation is performed, the work-efficiency of the system is diminished, reducing overall performance. For larger lists, the reduction in performance may be larger, limiting the usefulness of this common operation in GPU-based applications.
As the foregoing illustrates, what is needed in the art is a technique for performing efficient prefix sum operations on a GPU.