1. Field of the Invention
The present invention relates generally to data processing by concurrent processes on multiple processing units.
2. Background Art
In many applications, such as graphics processing, protein folding, encryption/decryption, video encoding/decoding, and the like, a sequence of threads process one or more data items in order to output a final result. In many modern parallel processors, for example, several single instruction multiple data (SIMD) processors concurrently execute sequences of groups of threads. Typically, the concurrently executing threads are identical (i.e., have the identical code base), and the thread sequences executed on the respective SIMD processors are also the same. A plurality of identical concurrent threads that are executed on separate processors is known as a thread wavefront.
When processing using a sequence of thread wavefronts, a first thread wavefront typically retrieves data from memory, performs some arithmetic processing upon the retrieved data, and then writes the processed data back into the memory. A second thread wavefront, typically executing on the same processor(s), can then retrieve the data written to memory by the first thread wavefront and perform further processing. However, if the data written by the first thread wavefront is not in the memory at the time that the data is required by the second thread wavefront, the second thread wavefront may not be able to proceed as intended and thus has to wait until the required data is in memory. This results in a wavefront stall.
In conventional parallel processing systems, for example, the threads of the second wavefront can each poll memory and wait for the data from the first wavefront to be available in the memory. Delays in writing data to the memory by a first wavefront can result in frequent and repetitive polling by a second wavefront requesting that data. Such frequent and repetitive polling can consume substantial portions of memory bandwidth and can also increase the memory footprint of synchronization buffers. The resulting reduction in available memory bandwidth and the increased memory footprint of the synchronization buffers can lead to further performance inefficiencies.
What are needed, therefore, are methods and systems to improve synchronization of thread wavefronts so that wavefront stalls can be reduced or eliminated.