3D graphics rendering has been implemented extensively in a variety of hardware (HW) architectures over the past few decades. With the advent of standardized rendering application programming interfaces (APIs) such as OpenGL and DirectX, a similar macro architectural structure has begun to emerge. The details and performance of any particular graphics HW architecture often hinges upon how quickly that architecture can retrieve and filter two or more textures when shading a given pixel or block of pixel fragments. For instance, some modern graphics architectures include eight or more pixels processing units to handle pixel shading and thus may require the processing of large numbers of texture memory access and filtering operations in a substantially parallel manner.
One way to minimize bandwidth usage during texture sampling is to process multiple, contiguous pixel fragments simultaneously as pixel fragment blocks. Because contiguous pixel fragments are more likely to sample the same texture coordinates, the use of fragment block processing may significantly reduce bandwidth requirements by avoiding redundant texture memory accesses. However, typical fragment block processing still requires significant bandwidth capacity for conveying the coordinates of all the block's fragments to the texture sampling unit.