Field of the Invention
Embodiments of the invention relate generally to parallel thread program execution, and more specifically to performing scan and reduction operations across multiple threads.
Description of the Related Art
Conventional parallel processing architectures support execution of multiple threads. More recently, parallel processing architectures allow for parallel threads to execute independently and support the execution of specific instructions to synchronize independently executing threads. In order to perform a scan or reduction operation across the multiple threads using current systems, each thread contributes values to the scan or reduce operations by writing the values in a memory shared by the threads, then synchronizes the threads, then reads all the values written by other threads from the shared memory, then computes the scan or reduction aggregated result or receives the aggregated result. The contributing of the values by the different independently executing threads is performed serially before the threads are synchronized. The scan or reduction operation typically requires several clock cycles to complete since each thread must access a shared memory to contribute a value, synchronize (wait) for other threads, and read several values from memory to compute a final result.
Accordingly, what is needed in the art is an improved technique for performing a scan or reduction operation across multiple threads executing independently.