A Graphics Processing Unit (GPU) is typically implemented as a graphics pipeline. Conventionally, one or more stages perform vertex and geometry processing to generate graphical primitives mapped to a two-dimensional screen space. The graphical primitives are typically triangles, although other primitive shapes may also be used. A setup stage is typically included to calculate the edges of graphical primitives (e.g., triangles) and perform other calculations to prepare the primitives for rasterization. A raster stage rasterizes the primitives by determining the coverage of the primitives across sample locations of pixels and generating a set of fragments and pixel locations. Conventionally, the fragments are provided to downstream components as stream of pixels, typically in groups of four pixels known as “quads.” A pixel processing stage, such as a shader stage, processes the fragments and colors the pixels. A raster operations (ROP) stage performs various functions such as frame buffer blending, antialiasing, and Z operations, such as z-culling. Modern GPUs also typically implement many stages with parallel processing units, such as parallel pixel shading elements working in parallel on different pixels.
Pixel shading typically requires a comparatively large amount of processing resources. In addition to performing shading operations for each quad group of pixels, a pixel shader may also have to perform a hazard avoidance check for each group of pixels that it receives. That is, each group of pixels must be checked for conflicts in reading/writing data to the same (x,y) pixel locations as other quads.
Moreover, pixel shading typically requires memory accesses to access and manipulate data. The memory access operations may be inefficient unless there is a high degree of memory coherency. That is, memory requests are ideally issued simultaneously to contiguous regions of memory in a manner that fully utilizes the capacity of the memory interface. For example, a pixel shading architecture with a high degree of memory coherency may issue memory requests in parallel to a contiguous region of memory on a memory page to improve performance. However, in some pixel shader architectures the stream-like manner in which pixel quads are processed makes it difficult to organize memory accesses with a high degree of memory coherency.
In light of above-described problems, the apparatus and method of the present invention was developed.