A graphics processing unit (GPU) is a standard element of the modern desktop personal computer (PC). Initially a feature of high-end graphics workstations, the GPU has found its way onto the PC bus as an accelerator of graphics functions for which a conventional central processing unit (CPU) was ill suited or simply too slow. Presently, the GPU is a prominent component of the PC with its own dedicated path to main CPU memory as well as its own dedicated graphics memory.
Interactive computer graphics began as line drawings on calligraphic displays, which were basically modified oscilloscopes. The computation for these early displays required vector operations including general geometric transformations, clipping to boundaries of the display devices, and perspective transformations for three-dimensional (3D) displays. The advent of inexpensive commodity semiconductor memory prompted the replacement of these line drawing systems by raster graphics processors, which refreshed television-like displays through a frame buffer memory.
Because the raster graphics systems could display shaded solid surfaces, which is generally preferable to line drawing for a wide range of applications, the raster graphics processor quickly displaced line drawings. Instead of straight line segments, the geometric primitives for these raster graphics systems were polyhedral surfaces constructed from an array of triangles. The display primitives were a rectangular array of pixels stored in the frame buffer memory. Rows of the array correspond to the discrete scan lines on the raster scan cathode ray tube (CRT).
The process of turning triangles into pixels is called “rasterization” and represents an additional step in the display pipeline. The geometric processing steps prior to rasterization are the same as those in the line drawing pipeline and are retained in the raster graphics display pipeline. A major addition to the raster graphics display pipeline is texture mapping, which is the ability to query rectangular images in arbitrary order to fetch color detail that can be applied to a solid surface as a decal. Over the years this process has been generalized to include mapping of almost any property, not just color, to a rasterized surface at a per-pixel rate. Although the features of the texture-mapped rendering pipeline have become richer, it still lacks the flexibility to match the visual realism produced by the per-pixel shading calculations seen in software rendering systems.
One solution to this problem is to build hardware with the flexibility to “inline” code for the rendering of a surface. The hardware still takes triangles as input, but also allows the instructions to compute each pixel as specified by a program, which can be loaded before the triangle(s) is (are) rendered. These programmable elements are called “pixel shaders.” Alternatively, pixel shaders are often called “fragment shaders.” The two terms are used interchangeably in this document. The instructions of the program in the shaders are close to assembly language since they each have a direct hardware implementation. There are competing languages (and hardware) for shaders, such as High Level Shader Language (HLSL), C graphics (Cg) from NVIDIA, as well as DirectX assembly code.
Co-processors are not a new feature of personal computers. The integral floating point co-processor shares the host CPU's instruction stream and has access to CPU memory. Additional integrated co-processors include CPU extensions such as multimedia extensions (MMX) or streaming SIMD extensions (SSE), which have parallel data paths, asynchronous execution, and also access to the CPU memory. GPUs, on the other hand are autonomous special purpose processors with their own instruction streams, datapaths, and dedicated memory. Trends in GPU design and configuration have given them larger dedicated memory, higher, bandwidth to graphics memory, increased internal parallelism, and increased degrees of programmability.
The memory directly available to a GPU is large but finite. Furthermore, programmability of GPUs remains limited both by the small number of instructions that can fit into a GPU computing element and by the special-purpose nature of the arithmetic units. The programmability limitation forces programmers to break operations into smaller parts, which are executed in sequence. Intermediate results are stored in main memory. The finite size of graphics memory means that the efficient layout of intermediate results in memory is of critical importance to the programmer. Further complicating data layout is the peculiarity of the special purpose graphics datapaths, which means that clever data layout affects the efficiency of execution as well as the programmer's ability to fit the results into graphics memory. Therefore, what is needed is a system and method for optimizing performance of a GPU that enables operations to be executed by the GPU for more efficient processing than is available using a CPU. Moreover, what is needed is a system and method for optimizing performance of a GPU that provides an efficient layout of data in memory in order to make maximum use of the memory and provide efficient execution by the GPU.