1. Field of the Invention
Embodiments of the present invention relate generally to parallel processing and more specifically to a radix sort algorithm for graphics processing units.
2. Description of the Related Art
A typical computer system includes, without limitation, a central processing unit (CPU), a parallel processing subsystem, such as a graphics processing unit (GPU), a display device, and one or more input devices. The user may interact with a software application executing within the computer system by operating at least one input device and observing the results on the display device. The CPU typically executes the overall structure of the software application and configures the GPU to perform specific tasks. In current technology, the CPU tends to offer more general functionality using a relatively small number of large execution threads, while the GPU is capable of very high performance using a relatively large number of small, parallel execution threads on dedicated hardware processing units.
A typical software application may include certain operations designed to execute on the CPU, while other operations execute on the GPU. For example, the CPU may be configured to run the graphical user interface (GUI) for the application and perform certain application-specific logic, whereas the GPU may be configured to perform computationally intensive tasks, such as rendering graphics images. Software applications typically execute as much computation work on the GPU as possible to improve overall system performance. However, certain common operations are not easily or efficiently mapped to the parallel architecture of the GPU. When the application performs an operation that does not have an efficient mapping to the parallel architecture of the GPU, a “work-inefficient” processing step is commonly needed, wherein the GPU processes related data with relatively low overall processor utilization for the duration of the processing step. Alternately, the CPU may perform the processing step instead of the GPU. In both cases, the overall application performance may suffer.
As is well known, sorting lists of data is one common processing operation used in a wide range of applications. However, many conventional sorting algorithms tend to be predominantly serial in execution, making these algorithms less efficient at exploiting the parallel GPU architecture. In general, therefore, sorting is performed by the CPU rather then the GPU, even when the sort input data is generated by the GPU. For GPU-based applications, waiting for the CPU to perform a sort operation can introduce inefficiency and significantly reduce overall application performance.
As the foregoing illustrates, what is needed in the art is a technique for performing efficient sort operations on a GPU.