Some basic background regarding the fundamental concepts associated with the value of the following ideas is beneficial.
Overdraw
Rendering synthetic 3D images commonly employs an algorithm for removing occluded or hidden surfaces. In the example illustrated in FIG. 1A, the yellow sphere obscures a portion of the blue sphere. When using a depth buffer style algorithm, the pixels that are nearest to the viewer, at the time the geometry is processed, update the final image. For example, if the geometry for the blue sphere is delivered prior to the yellow sphere, all of the pixels associated with the blue sphere will be written to the image, a portion of which will subsequently be overwritten by the yellow sphere. This extra work, that does not contribute to the final image, is called overdraw.
Rendering Intermediate Images
The process of rendering synthetic images also involves producing intermediate images, which are commonly used to represent various dynamic illumination effects. For example, generating shadows, reflections, specular lighting, and other contributions from global illumination models (see FIGS. 1A-1C). These effects are typically approximated using preliminary rendering passes, which capture illumination properties. The process of generating each of these intermediate images is known as a “render pass”. These images are then used as texture maps in subsequent render pass as indicated in FIGS. 1B and 1C. This is accomplished by rendering each such pass in the order presented by the application. This sequence is called Forward Rendering.
For example, consider shadow maps or projective shadowing whereby shadows are added to 3D synthetic image. If one looks out from a source of light, all of the objects seen will appear in light. Anything behind those objects, however, will be in shadow. This is the basic principle used to create a shadow map. The light's view is rendered into an intermediate image, thus storing the distance between the light source and every surface it sees (the shadow map). Later, the scene is rendered normally comparing the depth of every point drawn (as if it were being seen by the light, rather than the eye) to this shadow map. The images may also be referenced in a recursive manner, as seen by the Processed Shadow Map example in the previous figure.
Planar reflections are also produced using an intermediate image. The entire scene is rendered inverted. The scene is generally rendered with a wide field of view. During the final render pass the inverted reflection map is blended with the shiny planar surfaces, as shown on the floor of the previous figure. A tree or directed a-cyclical graph (DAG) of the dependencies between these images can be formulated, an example of which is shown in FIG. 2. Each of the intermediate images must be rendered in an order that meets these dependency requirements. Because these dynamically generated images will be sampled as texture maps in subsequent render passes at unpredictable locations, the entire image is rendered. Often a significant portion of the computation and power consumption required to render the final image is spent producing these intermediate images.
However, typically only a portion of these intermediate images contribute to the final image.
Computation and Power Saving Opportunities Stream Sorting
An opportunity exists to reduce the computation and power by eliminating the overdraw work associated with occluded pixels. Analysis indicates that 15-20% of the computation in a common benchmark is extraneous overdraw. However, traditional approaches to addressing this issue involve performing a depth sort prior to processing the pixels. In this way, only the pixels which are visible are actually processed. However, relatively large structures are required to store the unsorted pixels until all of the geometry that may influence the pixels is processed. When hardware is accelerating the depth sort process, these special purpose structures are integrated into the architecture and require significant area on the chip. The size of these structures is directly related to the quantity of geometry used to describe the scene. Hardware designers study the expected workloads for their device in order to select the optimal size of the structures.
Various forms of tile-based rendering are typically used in order to further reduce the size of these structures and provide the added benefit of significantly reducing memory bandwidth. Because the image is spatially partitioned into small rectangular regions, known as tiles, and further because only the geometry that may affect this portion of the image is processed, the size of the depth sorting structures may be reduced. When possible, the contents of the tile sized depth buffer and sorting structures are never written into memory, thus saving substantial memory activity.
Reducing the size of these structures is acutely important because under-sized structures result in significant performance and power penalties. If the capacity of the structures is exceeded all of the processing that has been performed to that point is discarded and must be restarted using less efficient mechanisms. Attempts to avoid a catastrophic capacity failure or minimize the penalty of a failure often involve synchronization and communication with software, also costing time and adding complexity. Because of a strong trend toward increasing geometry complexity in the future, existing hardware solutions continue to grow these structures. They do not provide a graceful degradation of performance and power consumption under stressful conditions. A solution which attempts to gain the benefit of eliminating overdraw, must also elevate the existing penalties associated with increasingly complex content.
The “Stream Sorting” mechanism presented below attains the power saving benefits of minimized overdraw, while eliminating the special purpose depth sorting structures and providing a graceful degradation of performance under stressful geometry loads.
Reverse Rendering
Opportunity also exists for significant power savings in the rendering of intermediate images. Rarely does all of the information generated in the intermediate images contribute to the final image. This is primarily due to the fact that the intermediate images are sparsely sampled. Consider the sampling charts of three shadow maps used in a popular benchmark, shown in FIGS. 3a-3c. When sampling is quantized to 4.times.4 blocks of pixels 13%, 55%, and 32% of the respective shadow maps contribute to the final image. Eliminating the extraneous work required to render the unused portions of the intermediate images provides an opportunity to reduce, and potentiality minimize, the overall computation and power consumption required to render the final image. In some cases, this excess work comprises 50% of the total computation.
The mechanisms proposed below capitalize on this opportunity and are collectively referred to as “Reverse Rendering”. Reverse Rendering, built upon Stream Sorting, significantly reduces computation and power consumption in a device.
Graph Execution
Stream Sorting and Reverse Rendering both depend on the ability of the architecture to configure the connectivity of the operations in a non-standard manner from a traditional rendering pipeline. These unique configurations are more generally termed a graph and apply to many more usage models than just those discussed here; for example, computer vision algorithms are particularly amenable to this capability, as can be seen in FIGS. 4A and 4B. An understanding of the approach employed by present architectures is helpful in understanding the significance of this invention.
Existing fixed rendering pipeline architectures simulate arbitrary graph execution by treating their existing fragment shader unit(s) as a single node in the graph. The device is programmed to perform a single operation. The input and output data associated with the processing node are stored in memory. In this manner, the entire graph can be simulated, by programming the first node, processing all of the input data from memory and writing the output data for that node to memory. The device is then reprogrammed for the second node's operations, the output data from the previous node's buffer is read back from memory, computations are performed for the second node and the results are written to memory. This procedure continues until all nodes in the graph have been processed. Two key disadvantages of this approach are important. First, the often-temporary data communicated between interior nodes is stored in memory, requiring significant power. Secondly, the CPU often must synchronize with the device in order to reprogram it for the next node's operations, costing time and power.