The present invention relates in general to parallel data processing, and in particular to a multithreaded parallel processor with loading of groups of threads for single-instruction, multiple data (SIMD execution).
Parallel processing techniques enhance throughput of a processor or multiprocessor system when multiple independent computations need to be performed. A computation can be divided into tasks that are defined by programs, with each task being performed as a separate thread. (As used herein, a “thread” refers generally to an instance of execution of a particular program using particular input data, and a “program” refers to a sequence of executable instructions that produces result data from input data.) Parallel threads are executed simultaneously using different processing engines inside the processor.
Numerous existing processor architectures support parallel processing. The earliest such architectures used multiple discrete processors networked together. More recently, multiple processing cores have been fabricated on a single chip. These cores are controlled in various ways. In some instances, known as multiple-instruction, multiple data (MIMD) machines, each core independently fetches and issues its own instructions to its own processing engine (or engines). In other instances, known as single-instruction, multiple-data (SIMD) machines, a core has a single instruction unit that issues the same instruction in parallel to multiple processing engines, which execute the instruction on different input operands. SIMD machines generally have advantages in chip area (since only one instruction unit is needed) and therefore cost; the downside is that parallelism is only available to the extent that multiple instances of the same instruction can be executed concurrently.
Conventional graphics processors use very wide SIMD architectures to achieve high throughput in image-rendering applications. Such applications generally entail executing the same programs (vertex shaders or pixel shaders) on large numbers of objects (vertices or pixels). Since each object is processed independently of all others but using the same sequence of operations, a SIMD architecture provides considerable performance enhancement at reasonable cost. Typically, a GPU includes one SIMD core that executes vertex shader programs, and another SIMD core of comparable size that executes pixel shader programs. In high-end GPUs, multiple sets of SIMD cores are sometimes provided to support an even higher degree of parallelism.
These designs have several shortcomings. First, the separate processing cores for vertex and shader programs are separately designed and tested, often leading to at least some duplication of effort. Second, the division of the graphics processing load between vertex operations and pixel operations varies greatly from one application to another. As is known in the art, detail can be added to an image by using many small primitives, which increases the load on the vertex shader core, and/or by using complex texture-mapping and pixel shading operations, which increases the load on the pixel shader core. In most cases, the loads are not perfectly balanced, and one core or the other is underused. For instance, in a pixel-intensive application, the pixel shader core may run at maximum throughput while the vertex core is idle, waiting for already-processed vertices to move into the pixel shader stage of the pipeline. Conversely, in a vertex-intensive application, the vertex shader core may run at maximum throughput while the pixel core is idle, waiting for new vertices to be supplied. In either case, some fraction of available processing cycles are effectively wasted.
It would therefore be desirable to provide a graphics processor that can adapt to varying loads on different shaders while maintaining a high degree of parallelism.