The present invention relates in general to parallel data processing, and in particular to SIMD instruction execution with multiple-state, multiple-data behavior.
Parallel processing techniques enhance throughput of a processor or multiprocessor system when multiple independent computations need to be performed. A computation can be divided into tasks that are defined by programs, with each task being performed as a separate thread. (As used herein, a “thread” refers generally to an instance of execution of a particular program using particular input data, and a “program” refers to a sequence of executable instructions that produces result data from input data.) Parallel threads are executed simultaneously using different processing engines inside the processor.
Numerous existing processor architectures support parallel processing. The earliest such architectures used multiple discrete processors networked together. More recently, multiple processing cores have been fabricated on a single chip. These cores (or discrete processors) are controlled in various ways. In some instances, known as multiple-instruction, multiple data (MIMD) machines, each core independently fetches and issues its own instructions to its own processing engine (or engines). In other instances, known as single-instruction, multiple-data (SIMD) machines, a core has a single instruction unit that issues the same instruction in parallel to multiple processing engines, which execute the instruction on different input operands. SIMD machines generally have advantages in chip area (since only one instruction unit is needed) and therefore cost; the downside is that parallelism is only available to the extent that multiple instances of the same instruction can be executed concurrently.
Conventional graphics processors use very wide SIMD architectures to achieve high throughput in image-rendering applications. Such applications generally entail executing the same programs (e.g., vertex shaders or pixel shaders) on large numbers of objects (e.g., vertices or pixels). Since each object is processed independently of all others but using the same sequence of operations, a SIMD architecture provides considerable performance enhancement at reasonable cost. In high-end GPUs, multiple SIMD cores are sometimes provided to support an even higher degree of parallelism.
One difficulty with SIMD instruction execution is management of changes in the state information associated with the program to be executed. For instance, the identifier of a primitive to be applied in a pixel shader program, which is typically shared across multiple pixels, is usually supplied as state information. In existing SIMD architectures, an issued instruction is executed for all program instances using the same state parameters. Thus, it is often necessary to break up program instances into multiple separately-executed SIMD groups at points in the input data stream where state parameters change, e.g., each time there is a transition from one primitive to the next in the case of pixel shaders. As a result, the SIMD core may execute two SIMD groups that are less than fully populated instead of one fully-populated group, resulting in inefficient use of the core's resources. In general, the more frequently state parameters change, the greater the resulting inefficiency. Further, as the maximum width (i.e., the number of parallel program instances) of a SIMD group that the processor accommodates increases, the likelihood of a state change—and therefore of the core executing groups that are less than fully populated—increases.
It would therefore be desirable to provide SIMD instruction execution in a manner that allows multiple values of state parameters to coexist within the same SIMD group.