The present invention relates in general to parallel data processing, and in particular to parallel data processing methods using arrays of threads that are capable of sharing data, including intermediate results, with other threads in a thread-specific manner.
Parallel processing techniques enhance throughput of a processor or multiprocessor system when multiple independent computations need to be performed. A computation can be divided into tasks, with each task being performed as a separate thread. (As used herein, a “thread” refers generally to an instance of execution of a particular program using particular input data.) Parallel threads are executed simultaneously using different processing engines, allowing more processing work to be completed in a given amount of time.
Numerous existing processor architectures support parallel processing. The earliest such architectures used multiple discrete processors networked together. More recently, multiple processing cores have been fabricated on a single chip. These cores are controlled in various ways. In some instances, known as multiple-instruction, multiple data (MIMD) machines, each core independently fetches and issues its own instructions to its own processing engine (or engines). In other instances, known as single-instruction, multiple-data (SIMD) machines, a core has a single instruction unit that issues the same instruction in parallel to multiple processing engines, which execute the instruction on different input operands. SIMD machines generally have advantages in chip area (since only one instruction unit is needed) and therefore cost; the downside is that parallelism is only available to the extent that multiple instances of the same instruction can be executed concurrently.
Graphics processors have used very wide SIMD architectures to achieve high throughput in image-rendering applications. Such applications generally entail executing the same programs (vertex shaders or pixel shaders) on large numbers of objects (vertices or primitives). Since each object is processed independently of all others using the same sequence of operations, a SIMD architecture provides considerable performance enhancement at reasonable cost. Typically, a GPU includes one SIMD core (e.g., 200 threads wide) that executes vertex shader programs, and another SIMD core of comparable size that executes pixel shader programs. In high-end GPUs, multiple sets of SIMD cores are sometimes provided to support an even higher degree of parallelism.
Parallel processing architectures often require that parallel threads be independent of each other, i.e., that no thread uses data generated by another thread executing in parallel or concurrently with it. In other cases, limited data-sharing capacity is available. For instance, some SIMD and MIMD machines provide a shared memory or global register file that is accessible to all of the processing engines. One engine can write data to a register that is subsequently read by another processing engine. Some parallel machines pass messages (including data) between processors using an interconnection network or shared memory. In other architectures (e.g., a systolic array), subsets of processing engines have shared registers, and two threads executing on engines with a shared register can share data by writing it to that register. In such instances, the programmer is required to specifically program each thread for data sharing, so that different threads are no longer executing the same program.
It would therefore be desirable to provide systems and methods for parallel processing that facilitate sharing of data among concurrently-executing threads.