The present invention relates in general to graphics processors, and in particular to parallel array architecture for a graphics processor.
Parallel processing techniques enhance throughput of a processor or multiprocessor system when multiple independent computations need to be performed. A computation can be divided into tasks that are defined by programs, with each task being performed as a separate thread. (As used herein, a “thread” refers generally to an instance of execution of a particular program using particular input data, and a “program” refers to a sequence of executable instructions that produces result data from input data.) Parallel threads are executed simultaneously using different processing engines inside the processor.
Numerous existing processor architectures support parallel processing. The earliest such architectures used multiple discrete processors networked together. More recently, multiple processing cores have been fabricated on a single chip. These cores are controlled in various ways. In some instances, known as multiple-instruction, multiple data (MIMD) machines, each core independently fetches and issues its own instructions to its own processing engine (or engines). In other instances, known as single-instruction, multiple-data (SIMD) machines, a core has a single instruction unit that issues the same instruction in parallel to multiple processing engines, which execute the instruction on different input operands. SIMD machines generally have advantages in chip area (since only one instruction unit is needed) and therefore cost; the downside is that parallelism is only available to the extent that multiple instances of the same instruction can be executed concurrently.
Conventional graphics processors use wide SIMD architectures to achieve high throughput in image-rendering applications. Such applications generally entail executing the same programs (vertex shaders or pixel shaders) on large numbers of objects (vertices or pixels). Since each object is processed independently of all others but using the same sequence of operations, a SIMD architecture provides considerable performance enhancement at reasonable cost. Some GPU architectures include an SIMD core that executes vertex shader programs and another SIMD core that executes pixel shader programs. Other GPUs have a unified shader architecture, in which a single SIMD core executes vertex and pixel shader programs. In high-end GPUs, multiple sets of SIMD cores may be provided to support an even higher degree of parallelism.
Moreover, graphics processing includes the manipulation, processing and displaying of images on video display screens. In order to process images efficiently, different components of the GPU specialize in performing different tasks. For example GPUs can include vertex processing units, texture units, color fragment processing units, Z-fragment processing units, etc. In order to process images using these different components, large amounts of data must be quickly and reliably transferred between the memory and the different components. The transfer of data between the different components and the memory can be done with crossbars that appropriately route data. These crossbars are implemented using tens of thousands of wires. As technology progresses, these crossbars become larger in proportion to the square of the degree of parallelism. This significantly degrades the degree to which GPUs can take advantage of improvements in process technology.
It would therefore be desirable to provide a graphics processor that reduces the number of crossbars without affecting performance.