1. Technical Field
The invention relates to the rendering of graphics in a computer environment. More particularly, the invention relates to a rendering pipeline system that renders graphical primitives displayed in a computer environment.
2. Description of the Prior Art
Graphical representations and user interfaces are no longer an optional feature but rather a requirement for computer applications. There is a pressing need to produce high performance, high quality, and low cost 3D graphics rendering pipelines because of this demand.
Some geometry processing units (e.g. general-purpose host processors or specialized dedicated geometry engines) process geometries in model space into geometries in screen space. Screen space geometries are a collection of geometric primitives represented by screen space vertices and their connectivity information. A screen space vertex typically contains screen x, y, z coordinates, multiple sets of colors, and multiple sets of texture attributes (including the homogeneous components), and possibly vertex normals. Referring to FIG. 1, the connectivity information is conveyed using basic primitives such as points, lines, triangles 101, or strip 102, or fan 103 forms of these basic primitives.
In a traditional architecture, raster or rasterization refers to the following process:
Given screen x and y positions as well as all other parameter values for all vertices of a primitive, perform parameter setup computation in the form of plain equations; scan convert the primitive into fragments based on screen x and y positions; compute parameter values at these fragment locations. Referring to FIG. 2, a traditional rendering pipeline is shown. Screen geometries 201 are rasterized 202. The shading process 203 is then performed on the graphics primitives. The z/alpha blending process 204 places the final output into the color/z frame buffer 205 which is destined for the video output 206. There is a serious concern with the memory bandwidth between the z/alpha-blending/pixel-op process 204 and the frame buffer in the memory 205. To z-buffer 100 Mpixels/s, assuming 4 bytes/pixel for RGBA color, 2 bytes/pixel for z, and 50% of the pixels actually being written into the frame buffer on average due to z-buffering. The memory bandwidth is computed as follows:
100 Mpixels/s*(2 bytes+50%*(4 bytes+2 bytes))/pixel=500 Mbytes/s
The equation assumes a hypothetical perfect prefetch of pixels from frame buffer memory into a local pixel cache without either page miss penalty or wasteful pixels.
The actual memory bandwidth is substantially higher because the read-modify-write cycle required for z-buffering cannot be implemented efficiently without a complicated pipeline and long delay. Alpha blending increases the bandwidth requirement even further. The number is dramatically increased if full-scene anti-aliasing is performed. For example, 4-subsample multi-sampling requires the frame buffer memory access bandwidth by the z/alpha-blending/pixel-op engine 204 to roughly quadruple, i.e. at least 2 Gbytes/s of memory bandwidth is required to do 4-subsample multi-sampling at 100 Mpixels/s. Full-scene anti-aliasing is extremely desirable for improving rendering quality; however, unless either massive memory bandwidth is applied (e.g. through interleaving multiple processors/memories), which leads to rapid hardware cost increase or compromised pixel fill performance, full scene anti-aliasing is impractical to implement under a traditional rendering pipeline architecture. Full scene anti-aliasing also requires the frame buffer size to increase significantly, e.g. to quadruple in the case of 4-subsample multi-sampling.
Another drawback with the traditional rendering pipeline is that all primitives, regardless if they are visible or not, are completely rasterized and corresponding fragments are shaded. Considering a pixel fill rate of 400 Mpixels for non-anti-aliased geometries and assuming a screen resolution of 1280xc3x971024 with a 30 Hz frame rate, the average depth complexity is 10. Even if there is anti-aliasing, the average depth complexity is still between 6xcx9c7 for an average triangle size of 50 pixels. The traditional pipeline therefore wastes a large amount of time rasterizing and shading geometries that do not contribute to final pixel colors.
There are other approaches which attempt to resolve these problems. With respect to memory bandwidth, two solutions exist. One approach is to use a more specialized memory design by either placing sophisticated logic on Dynamic Random Access Memory (DRAM) (e.g. customized memory chips such as 3DRAM) or placing a large amount of DRAM on logic. While this can alleviate the memory bandwidth problem to a large extent, it is not currently cost-effective due to the-economy-of-scale. In addition, the frame buffer size in the memory grows dramatically for full-scene anti-aliasing.
The other alternative is by caching the frame buffer on-chip, which is also called virtual buffering. Only a portion of frame buffer can be cached at any time because on-chip memory is limited. One type of virtual buffering uses the on-chip memory as a general pixel cache, i.e. a window into the frame buffer memory. Pixel caching can take advantage of spatial coherence, however, the same location of the screen might be cached in and out of the on-chip memory many times during a frame. Therefore, it uses very little intra-frame temporal coherence (in the form of depth complexity).
The only way to take advantage of intra-frame temporal coherence reliably is through screen space tiling (SST). First, by binning all geometries into tiles (also called screen subdivisions which are based on screen locations). For example, with respect to FIG. 3, the screen 301 is partitioned into 16 square, disjoint tiles, numbered 1302, 2303, 3304, up to 16312. Four triangles a 313, b 314, c 315, and d 316 are binned as follows:
tile 5306: a 313
tile 6307: a 313, b 314, c 315
tile 7308: c 315, d 316
tile 9309: a 313
tile 10310: a 313, b 314, c 315, d 316
tile 11311: c 315, d 316
Secondly, by sweeping through screen tiles, processing a tile""s worth of geometry at a time, using an on-chip tile frame buffer, producing the final pixel colors corresponding to the tile, and outputting them to the frame buffer. Here, the external frame buffer access bandwidth is limited to the final pixel color output. There is no external memory bandwidth difference between non-anti-aliasing and full-scene anti-aliasing. The memory footprint in the external frame buffer is identical regardless if non-anti-aliasing or full-scene anti-aliasing is used. There is no external depth-buffer memory bandwidth effectively, and the depth-buffer need not exist in the external memory. The disadvantage is that extra screen space binning is introduced, which implies an extra frame of latency.
Two main approaches exist with respect to depth complexity. One requires geometries sorted from front-to-back and rendered in that order and no shading of invisible fragments.
The disadvantages to this first approach are: 1) spatial sorting needs to be performed off-line, and thus only works reliably for static scenes, dynamics dramatically reduce the effectiveness; 2) front-to-back sorting requires depth priorities to be adjusted per frame by the application programs, which places a significant burden on the host processors; and 3) front-to-back sorting tends to break other forms of coherence, such as texture access coherence or shading coherence. Without front-to-back sorting, one-pass shading-after-z for random applications gives some improvement over the traditional rendering pipeline, however, performance improvement is not assured.
The other approach is deferred shading where: 1) primitives are fully rasterized and their fragments are depth-buffered with their surface attributes; and 2) the (partially) visible fragments left in the depth-buffer are shaded using the associated surface attributes when all geometries are processed at the end of a frame. This guarantees that only visible fragments are shaded.
The main disadvantages with this approach are: 1) deferred shading breaks shading coherence; 2) deferred shading requires full rasterization of all primitives, including invisible primitives and invisible fragments; 3) deferred shading requires shading all subsamples when multi-sample anti-aliasing is applied; and 4) deferred shading does not scale well with a varying number of surface attributes (because it has to handle the worst case).
It would be advantageous to provide a rendering pipeline system that lowers the system cost by reducing the memory bandwidth consumed by the rendering system. It would further be advantageous to provide an efficient rendering pipeline system that writes visible fragments once into the color buffer and retains coherence.
The invention provides a rendering pipeline system for a computer environment. The invention uses a rendering pipeline design that efficiently renders visible fragments by decoupling the scan conversion/depth buffer processing from the rasterization/shading process. It further provides a rendering pipeline system that reduces the memory bandwidth consumed by frame buffer accesses through screen space tiling. In the invention, raster or rasterization refers to the following process:
For each visible primitive, parameter setup computation is performed to generate plane equations. For each visible fragment of said visible primitive, parameter values are computed. Scan conversion is excluded from the rasterization process.
The invention uses screen space tiling (SST) to eliminate the memory bandwidth bottleneck due to frame buffer access. Quality is also improved by using full-scene anti-aliasing. This is possible under SST because only on-chip memory corresponding to a single tile of the screen, as opposed to the full screen, is needed. A 32xc3x9732 tile anti-aliased frame buffer is easily implemented on-chip, and a larger tile size can later be accommodated. Additionally, the invention performs screen space tiling efficiently, while avoiding the breaking up of primitives the invention also reduces the buffering size through the use of single+buffering.
The invention uses a double-z scheme that decouples the scan conversion/depth-buffer processing from the more general rasterization and shading processing. The core of double-z is the scan/z engine, which externally looks like a fragment generator but internally resolves visibility. It allows the rest of the rendering pipeline to rasterize only visible primitives and shade only visible fragments. Consequently, the raster/shading rate is decoupled from the scan/z rate. The invention also allows both opaque and transparent geometries to work seamlessly under this framework.
The raster/shading engine is alternatively modified to take advantage of the reduced raster/shading requirements. Instead of using dedicated parameter computing units, one can share a generic parameter computing unit to process all parameters.