1. Field of the Invention
Embodiments of the present invention relate generally to a parallel rendering system and more specifically to a method and system for improving data coherency in such a system.
2. Description of the Related Art
Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
A graphics system generally adopts a highly parallel architecture to meet the ever increasing processing demands for realism, quality, and real-time interactivity of displayed images and videos. FIG. 1A is a simplified conceptual diagram of a parallel rendering system, 100. Specifically, a setup unit 101 receives geometric primitives from a geometry pipeline. Setup unit 101 determines the interpolation parameters for each geometric primitive that it receives. Then setup unit 101 directs a stream of the transformed primitives to coarse raster 102, which determines which pixel tiles may have contributions from any of these transformed primitives. Coarse raster 102 also directs these identified pixel tiles, containing 16 by 16 pixels in one implementation, via distributor 104 to any of fine raster0 to fine rastern. Each fine raster is associated with a shader engine and is responsible for computing a fine raster transaction for a subset of pixels (e.g., 4 by 4 samples in one implementation) within each pixel tile at least partially covered by a geometric primitive. Each of the n shader engines is screen mapped. In other words, each shader engine is assigned a rectangular pixel tile in the screen space (16 by 16 pixels in one implementation) in an interleaved pattern. All geometry primitives that hit a particular pixel tile are processed by the shader engine assigned to that pixel tile.
In addition, each of the n shader engines receives the fine raster transactions from its associated fine raster (e.g., shader engine0 is associated with fine raster0). Then each shader engine computes the color values for each of the covered pixels described by the fine raster transaction, performs texture fetches via a texture crossbar (T-crossbar), such as T-crossbar 110, and directs the shaded pixels to a crossbar for Z data (Z-crossbar), such as Z-crossbar 106, and a color crossbar (C-crossbar), such as C-crossbar 108. More specifically, T-crossbar 110 arbitrates among the parallel streams of texture requests from the multiple shader engines and directs the selected requests to a frame buffer, such as FB0. Conversely, T-crossbar 110 may also direct the retrieved texels from FB0 to the requesting shader engine, such as shader engine0. After performing functions such as texture mapping and texture blending to determine the final surface properties, the shader engines direct the shaded pixel fragments associated with the fine raster transactions to Raster OPerations (“ROP”) units (e.g., ROP0, ROP1, and ROPn) via Z-crossbar 106 and C-crossbar 108 for further processing. Each ROP unit conducts depth and stencil testing on the shaded pixels, as specified by the graphics application, and writes the pixels surviving depth and stencil testing to its corresponding frame buffer (e.g., FB0 corresponds to ROP0). It should be noted that each of the aforementioned frame buffers maps to a contiguous span of memory. Also, the various crossbars discussed above (e.g., T-crossbar 110, Z-crossbar 106, and C-crossbar 108) could be implemented in a unified and shared crossbar.
The overall performance of parallel rendering system 100 described above degrades significantly if the frame buffers are accessed inefficiently. For instance, if the shader engines simultaneously issue requests to access locations in a particular memory bank of a frame buffer that map to different pages in that memory bank, then multiple pages need to be accessed to fulfill such requests. Because each page needs to be pre-charged prior to the utilization of the page, fulfilling these simultaneous requests to access the multiple pages usually introduces undesirable delays, each of which may equal the amount of time needed to activate and pre-charge a page. Furthermore, because fetched texels are typically cached in parallel rendering system 100, retrieving texels from different pages often times leads to cache misses and results in excessive cache thrashing.
Inefficient accesses to the frame buffers can be attributed to having output streams from the parallel shader engines with poor spatial and temporal locality. In particular, shader engines issue memory requests for texture reads, Z reads and/or writes, and color reads and/or writes as processing proceeds. Statistically, when a shader engine begins to operate out of phase relative to the other shader engines, meaning that the shader engine starts finishing work at a different time than the other shader engines, the shader engine is more likely to operate on areas of the screen space that are far from the other shader engines. Such screen space “divergence” can cause the different shader engines to access different frame buffer pages at the same time, leading to the type of inefficient frame buffer accesses described above. In addition, multiple shader engines may request for the same frame buffer page but at different times. The time lapse between the requests may cause the page to be activated and pre-charged again, resulting also in inefficient frame buffer accesses.
As an illustration, FIG. 1B shows a display screen divided into pixel tiles that map to shader engines and pixel tiles that map to frame buffer partitions. In one implementation, each of the shader engines processes 16 by 16 pixel tiles of display screen 130, and each of the frame buffer partitions in memory system 160 stores 16 by 4 pixel tiles of display screen 130. Also, a page boundary 162 in memory system 160 separates page 164 from page 166. Here, pixel tile 132 maps to four different frame buffer partitions, namely FB0, FB1, FB2, and FB3 in page 164, and pixel tile 138 maps to the same four frame buffer partitions, FB0, FB1, FB2, and FB3, but in page 166. Suppose shader engine0 is responsible for pixel tiles 132 and 136, and shader engine1 is responsible for pixel tiles 134 and 138. Because successive tiles processed by each shader engine are spatially far apart in display screen 130, they are likely to access different pages in the respective FBs, such as pages 164 and 166. Each of the FBs, such as FB0, receives requests from multiple shader engines, which can have various skews relative to each other. For example, if shader engine0 and shader engine1 finish their respective operations at different times for pixel tiles 140 and 148, they may intermingle accesses to different pages in FB0. As mentioned above, intermingling accesses to different pages in a single frame-buffer partition results in inefficient frame buffer accesses. Without a corrective mechanism, shader engine0 may move even further out of synchronization with the other shader engines and worsen the aforementioned problem of inefficient accesses to the frame buffers.
As the foregoing illustrates, what is needed in the art is a technique to counter the possible divergence among the shader engines to increase the likelihood of efficient accesses of the frame buffers in a parallel rendering system.