This specification relates to techniques for efficient parallel computation of multivalue reductions using parallel processing hardware.
A reduction is an operation that combines multiple values into a single value. For example, a reduction over 8 values can be performed by computing a single sum over the 8 values. Reduction operations are commonly performed by parallel processing devices, e.g., graphics-processing units (GPUs), in order to combine data computed by multiple threads executed by multiple independent processing units of the parallel processing device. The examples described in this specification will commonly refer to the independent processing units being streaming multiprocessors (SMs) having multiple processing cores and the parallel processing device being a graphics processing unit (GPU). However, the same techniques can also be implemented on other hardware devices that implement true thread parallelization with multiple independent processing units. Such devices include single instruction, multiple data (SIMD) processors generally, tensor processing units (TPUs), or other application-specific integrated circuits. In addition, where the examples mention the use of a GPU, this does not necessarily imply that graphics data is being processed or produced.
On such parallel processing devices, control over thread parallelization can be provided by program abstractions that define how threads are assigned to be executed by the multiple independent processing units. For clarity of presentation, this specification uses the terminology of common GPU program abstractions, but equivalent program abstractions that control how threads are scheduled on independent processing units can be used for other systems that are not GPUs.
A thread block, or for brevity, a block, is a group of threads that are executed by a single SM. Threads in a block can coordinate by making use of shared memory of the SM. Communication between threads in a block is therefore typically orders of magnitude faster than communicating with threads in other blocks.
A warp is a group of threads within a block and in some cases represents the smallest assignable unit of computation for a GPU. Threads within a warp can typically read from registers assigned to other threads in the same warp. Threads in a warp also typically execute instructions in lockstep. Thus, threads within a warp can, for example, fetch data from register locations concurrently. Common warp sizes are 16, 32, or 64 threads, to name just a few examples.
The parallel processing capabilities of a parallel processing device allows single-value reductions to be performed as a series of aggregate operations by reading data in exponentially increasing or decreasing steps or skips. For example, if a warp has 8 threads, each thread can sum from its neighbor one step over, then two-steps over, and then 4 steps over. At the end of this process, one of the threads will have a sum over all values in the original data.
However, performing multivalue reductions conventionally requires the serial performance of multiple single-value reductions. This limitation is a processing bottleneck in many real-world applications that require extreme throughput requirements. For example, audio generation neural networks that model raw audio waveforms present significant computational challenges because of the basic high-throughput nature of raw audio generation. Realistic raw audio generation typically requires multiple thousands of audio samples to be generated per second, e.g., 24,000 samples per second. In such high-throughput applications, any parallel processing speedups are vital.