1. Field of the Invention
Embodiments of the present invention relate generally to a parallel and pipelined graphics architecture and more specifically to a high-performance crossbar in a graphics pipeline.
2. Description of the Related Art
Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
A graphics system generally adopts a highly parallel and pipelined architecture to meet the ever increasing demands for realism, quality, and real-time interactivity of displayed images and videos. FIG. 1A is a conceptual diagram of a graphics rendering pipeline, 100. Geometry processing block 102 receives geometry primitives, typically triangles, from a graphics application and conducts geometric transforms as specified by the graphics application. The output of geometry processing block 102 includes triangles transformed and projected onto a two dimensional surface, referred to as “screen space,” corresponding to a window on the viewer's screen. The geometric primitives in screen space emitted by geometry processing block 102 are decomposed by rasterization block 104 into fragments, corresponding to screen space pixels that are least partially covered by the geometric primitives. Additionally, rasterization block 104 determines the screen space coverage and alignment of each geometric primitive with respect to memory tiles, each of which refers to a contiguous span of memory within a certain partition of frame buffer 110. Shader 106 receives fragments from rasterization block 104 and processes the fragments according to shading instructions specified by the graphics application or otherwise. The processed fragments are transmitted to Raster OPerations (“ROP”) block 108 for further processing. ROP block 108 conducts any depth and stencil testing on the shaded pixels, as specified by the graphics application. Pixels surviving depth and stencil testing are written to frame buffer 110. Video refresh block 112 then scans out the data stored in frame buffer 110 to a display device.
To determine the final surface properties of an object or image, some of the functions performed by shader 106 include texture mapping and texture blending. In one implementation, shader 106 may include multiple texture processing clusters (“TPC”) operating in parallel, and ROP block 108 may also include multiple ROP units operating in parallel. Each of the TPCs generally retrieves and combines appropriate texels with interpolated color values and directs transaction requests corresponding to the shaded pixels to the ROP units. Each ROP unit corresponds to a particular partition in frame buffer 110. For M TPCs to transfer data to N ROP units efficiently, one approach is to use the crossbar architecture of FIG. 1B to route data from any one of the M TPCs to any one of the N ROP units. As an illustration, suppose TPC 1 intends to send the processed fragment to ROP unit 1, because the memory tiles associated with this fragment reside in the frame buffer partition that corresponds to ROP unit 1. TPC 1 sends the transaction request corresponding to the processed fragment to crossbar 150, and crossbar 150 arbitrates among the various transaction requests generated by other TPCs and routes the data to ROP unit 1.
One problem occurs when two or more TPCs transmit requests to send data to the same ROP unit. Suppose TPC 1 and TPC 2 both transmit requests to send data to ROP unit 1. Crossbar 150 is configured to service only one of these two requests and block the other. This act of blocking in effect generates a stall at the input of crossbar 150 and consequently impedes the processing of the subsequent stages of graphics rendering pipeline 100 of FIG. 1A. Another problem occurs when there is uneven distribution of work among the ROP units. For example, if all of the M TPCs attempt to send data to ROP unit 1 only, then ROP unit 2 to ROP unit N are potentially idle for M clock cycles. Each idling clock cycle for a ROP unit is also referred to as a “bubble.” The more bubbles exist in a parallel system the more resources in the system are not being fully utilized, resulting in the degradation of the overall performance of the system.
As the foregoing illustrates, what is needed is an improved crossbar architecture that addresses one or more of the aforementioned performance problems.