There is increasing interest in General Purpose Graphics Processing Units (GPGPUs) that include a plurality of streaming multiprocessors. GPGPUs are GPUs that may also be used for other types of processing, such as image processing and scientific processing. Background information on GPGPUs and streaming multiprocessors are described in the book, GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation, editors Matt Pharr and Randima Fernando, Pearson Education (2005), the contents of which are hereby incorporated by reference.
Advances in semiconductor technology permit a GPGPU to have a large number of computation units on a single die. As described in chapter 29 of GPU Gems 2, in a streaming programming model, all data is represented as a stream, where a stream is an ordered set of data of the same data type. Kernels operate on entire streams of elements. In a stream programming model, applications are constructed by chaining multiple kernels together. Since kernels operate on entire streams, stream elements can be processed in parallel using parallel multiprocessors. One model for a high performance GPU includes a task parallel organization, in that all kernels can be run simultaneously, and a data level parallelism in that data is processed in parallel computation units.
One problem associated with a highly parallel streaming multiprocessor GPGPU is handling data dependencies. Since the streaming multiprocessors are designed to perform parallel computations, they typically operate independently of each other with no significant direct communication between streaming multiprocessors to synchronize data flow between the streaming multiprocessors. However, conventional techniques to control the flow of data required for dependent calculations would require comparatively complex hardware. For example, while snooping techniques or directories might be used to monitor and control the flow of data between individual streaming multiprocessors, this would increase the cost and complexity of the GPGPU architecture.
Therefore, in light of the above described problems the apparatus, system, and method of the present invention was developed.