Field of the Invention
Embodiments of the present invention relate generally computer processing and, more specifically, to controlling multi-pass rendering sequences in a cache tiling architecture.
Description of the Related Art
Some parallel processing systems implement a tiling architecture in which a render target is divided into cache tiles. Work received by such a parallel processing subsystem is rearranged such that the work is processed in cache tile order, as opposed to primitive order. More specifically, work associated with a first cache tile is processed first, then work associated with a second cache tile is processed, then work associated with a third cache tile, and so forth. In some implementations, data associated with cache tiles is maintained in an on-chip cache memory while the cache tiles are being processed, which reduces the amount of traffic between the on-chip cache memory and off-chip frame buffer. Such an approach reduces memory bandwidth utilization and associated power consumption.
Such parallel processing systems typically execute work one “pass” at a time, where a pass includes the work involved in rendering to one render target, and a change in the render target designates a boundary between passes. For example, in a first pass, the parallel processing subsystem would render to a first render target—in cache tile order—using the cache memory and then write the final contents of the first render target to the frame buffer. Subsequently, in a second pass, the parallel processing subsystem would render to a second render target, again, in cache tile order and using the cache memory, and then write the final contents of the second render target to the frame buffer.
While parallel processing systems that implement such a “single-pass” tiling technique may optimally execute single-pass rendering tasks, such parallel processing systems do not effectively execute “multi-pass” rendering sequences across multiple render targets. More specifically, when performing a multi-pass rendering sequence, intermediate data can be generated during one execution pass that is necessary for at least one subsequent execution pass in the multi-pass rendering sequence. Such intermediate data is typically written to an intermediate render target. Consequently, in parallel processing systems that implement single-pass tiling, the intermediate data is written to the frame buffer before the next pass in the multi-pass rendering sequence executes. However, because this data is needed for at least one subsequent execution passes in the multi-pass rendering sequence, to increase performance and reduce power consumption, the data should be maintained in the cache memory and not written to the frame buffer.
For these reasons, some parallel processing systems optimize the timing involved with storing data in local cache memory and storing data in the frame buffer for limited types of multi-pass rendering sequences based on a pre-defined set of render target operations. In particular, because the data dependencies between passes in certain multi-pass rendering sequences can be pre-determined, the parallel processing system can implement specific techniques to ensure that the data required to perform the rendering operations for any given rendering pass in the multi-pass sequence is available in the cache memory instead of the frame buffer.
For example, to implement a multi-sample resolve operation, the parallel processing system could render to a multi-sample surface in the cache memory and to a single-sample surface in the frame buffer. For a given tile, the parallel processing system could render to the multi-sample surface in a region associated with the given tile. The parallel processing system would then copy rendered data from the cache memory to the frame buffer to resolve to the single-sample surface in the region. Finally, the parallel processing system could invalidate the multi-sample surface data in the region, thereby ensuring that the invalidated data is not written to the frame buffer between passes.
One drawback to implementing such pre-defined sets of render target operations is that software applications often create multi-pass rendering sequences during execution to implement a variety of rendering tasks, such as texture mapping tasks. Because such multi-pass sequences are unknown prior to execution, properly capturing such a sequence with a pre-defined set of render target operations is quite difficult, and attempts to do so usually results in errors. Notably, if the pre-defined set of render target operations does not properly reflect the data dependencies, then the parallel processing system typically resorts to writing intermediate data out to the frame buffer after the first pass, unnecessarily consuming memory bandwidth and power.
As the foregoing illustrates, what is needed in the art is a more effective approach to performing multi-pass rendering sequences in a cache tiling architecture.