1. Field of the Invention
The present invention generally relates to the field of computer graphics systems. More particularly, the present invention relates to rasterization and fill rate within computer graphics systems.
2. Description of the Related Art
Modern graphics systems have been rapidly increasing their performance as the result of ever increasing clock speeds and higher levels of integration. Smaller geometries and higher clock frequencies have led to significant improvements in the number of triangles that may be rendered per frame, and the number of frames that may be rendered per second.
However, new applications such as three-dimensional (3D) modeling, virtual reality, and 3D computer games continue to demand even greater performance from graphics systems. Thus designers have continued to improve performance throughout the entire graphics system pipeline to try and meet the performance needs of these new applications.
FIG. 1 illustrates one example of a generic graphics system, but numerous variations are possible and contemplated. As shown in the figure, initially graphics data is read from a computer system""s main memory into the graphics system. The graphics data may include polygons, NURBS (Non-Uniform Rational B-Splines), sub-division surfaces, and other types of data. The various types of data are typically converted into triangles (i.e., three vertices each having at least position and color information). Then, transform and lighting calculation units 50 receive and process the triangles. Transform calculations typically include changing a triangle""s coordinate axis, while lighting calculations typically determine what effect, if any, lighting has on the color of triangle""s vertices. The transformed and lit triangles are then conveyed to a clip test/back face culling unit 52 that determines which triangles are outside the current parameters for visibility (e.g., triangles that are off screen).
Next, the triangles that pass the clip test and back-face culling are translated into screen space 54. The screen space triangles are then forwarded to the set-up and draw processor 56 for rasterization. Rasterization typically refers to the process of generating actual pixels by interpolation from the vertices. In some cases samples are generated by the rasterization process instead of pixels. A pixel typically has a one-to-one correlation with the hardware pixels present in a display device, while samples are typically more numerous than the hardware elements and need not have any direct correlation to the display device. Regardless of whether pixels or samples are used, once drawn they are stored into a frame buffer 58.
Next, the pixels are read from frame buffer 58 and converted into an analog video signal by digital-to-analog converters 60. If samples are used, the samples are read out of frame buffer 58 and filtered to generate pixels, which are then conveyed to digital to analog converters 60. The video signal from converters 60 is conveyed to a display device 62 such as a computer monitor, LCD display, or projector.
As noted above, many applications place great demands on graphics systems. In some graphics systems, the rasterization algorithm is configured to calculate multiple pixels/samples per clock cycle called xe2x80x9ctilesxe2x80x9d. Unfortunately, this can lead to less than ideal datapath utilization due to an effect called fragmentation. Fragmentation occurs when a portion of the rasterization hardware is assigned to areas outside of the geometry currently being rasterized. For example, a rasterization algorithm that calculates tiles of two horizontally adjacent pixels per cycle may experience fragmentation when the geometry being rasterized has an odd width in pixels. The last cycle of rasterization on an odd width will have only one pixel to calculate. The adjacent pixel, being outside of the current geometry, will not be rendered. This causes an inefficiency as subsequent hardware in the pipeline will be unused for this tile""s missing or disabled pixel. For example, if the set-up and draw processor is configured to rasterize one tile having two pixels per clock cycle, and if the frame buffer memory is configured to store one tile per clock cycle, then only 50% of the frame buffer""s memory bandwidth is used on cycles that write only one pixel. This inefficiency can cause a reduction in graphics system performance because frame buffer bandwidth (also called fill rate) is often a limiting factor in graphic systems. Thus, a system and method capable of improving fill rate performance with respect to fragmentation is desired.
The problems set forth above may at least in part be solved by a system and method that are capable of packing pixels together to provide a more efficient utilization of post-rasterization hardware in the graphics system.
In one embodiment, the graphics system is configured to receive and rasterize graphics data. The rasterization process may be performed at a faster cycle rate than the post-rasterization hardware in the graphics system. The output from the rasterization hardware is stored in a FIFO memory that is configured to shift pixels in order to improve fill rate performance. Advantageously, in some embodiments spatial adjacency and pixel enable matching between cycle requirements may be reduced or eliminated. Furthermore, in some embodiments pixels may be packed from several different cycles into the current cycle, including skipping over cycles that do not contain a packing opportunity.
In one embodiment, the method includes storing tiles of potentially fragmented rasterization data (e.g., pixels) into a queue (e.g., in FIFO memory) in which writes to the frame buffer are assembled. For each unused pixel position in the tile at the head of the queue, a search is performed looking back into the queue for a tile that contains an enabled pixel in the same position (e.g., relative pixel position within the tile). Pixels meeting selected criteria (e.g., belonging to the same memory block and different interleave relative to the other pixels in the tile with the empty pixel position) are removed from their original tile and placed in a tile at the head of the queue. Any tiles in the queue that no longer contain any pixels may be dropped. By dropping empty tiles in a faster clock domain and sending more fully packed cycles to a slower clock domain, the utilization of the slower clock domain is improved. This may advantageously improve the percentage utilization of the frame buffer""s fill bandwidth.
Since the tile at the head of the queue may contain pixels from a different part of the screen relative to the tiles behind the head of the queue, the tile at the head of the queue may be configured to carry unique position information for each pixel. That information may be derived from information carried with each pixel""s original tile.
Through selection of the depth of the queue (i.e., the number of tiles that are candidates for selecting pixels for packing), the utilization can be improved and a tradeoff of packing hardware versus utilization can be made.
Certain graphics memory systems place restrictions on the X and Y screen locations that can be stored to the frame buffer in a single cycle. For example, if a particular store cycle contains pixels from different DRAM pages, the frame buffer memory may not be able to process this store in a single clock cycle. Advantageously, however, the method may be configured to allow for restrictions on X and Y locations of the pixels that are candidates for packing together in a single tile/cycle. While inspecting subsequent tiles for possible pixel packing opportunities, a check of the subsequent tile""s X and Y locations can be made. If that tile is in a screen location that is not compatible with the memory system""s restrictions, then that subsequent cycle is not considered for pixel packing.