Background: 3D Computer Graphics
One of the driving features in the performance of most single-user computers is computer graphics. This is particularly important in computer games and workstations, but is generally very important across the personal computer market.
For some years, the most critical area of graphics development has been in three-dimensional (“3D”) graphics. The peculiar demands of 3D graphics are driven by the need to present a realistic view, on a computer monitor, of a three-dimensional scene. The pattern written onto the two-dimensional screen must, therefore, be derived from the three-dimensional geometries in such a way that the user can easily “see” the three-dimensional scene (as if the screen were merely a window into a real three-dimensional scene). This requires extensive computation to obtain the correct image for display, taking account of surface textures, lighting, shadowing, and other characteristics.
The starting point (for the aspects of computer graphics considered in the present application) is a three-dimensional scene, with specified viewpoint and lighting (etc.). The elements of a 3D scene are normally defined by sets of polygons (typically triangles), each having attributes such as color, reflectivity, and spatial location. (For example, a walking human, at a given instant, might be translated into a few hundred triangles which map out the surface of the human's body.) Textures are “applied” onto the polygons, to provide detail in the scene. (For example, a flat, carpeted floor will look far more realistic if a simple repeating texture pattern is applied onto it.) Designers use specialized modelling software tools, such as 3D Studio, to build textured polygonal models.
The 3D graphics pipeline consists of two major stages, or subsystems, referred to as geometry and rendering. The geometry stage is responsible for managing all polygon activities and for converting three-dimensional spatial data into a two-dimensional representation of the viewed scene, with properly-transformed polygons. The polygons in the three-dimensional scene, with their applied textures, must then be transformed to obtain their correct appearance from the viewpoint of the moment; this transformation requires calculation of lighting (and apparent brightness), foreshortening, obstruction, etc.
However, even after these transformations and extensive calculations have been done, there is still a large amount of data manipulation to be done: the correct values for EACH PIXEL of the transformed polygons must be derived from the two-dimensional representation. (This requires not only interpolation of pixel values within a polygon, but also correct application of properly oriented texture maps.) The rendering stage is responsible for these activities: it “renders” the two-dimensional data from the geometry stage to produce correct values for all pixels of each frame of the image sequence.
The most challenging 3D graphics applications are dynamic rather than static. In addition to changing objects in the scene, many applications also seek to convey an illusion of movement by changing the scene in response to the user's input. Whenever a change in the orientation or position of the camera is desired, every object in a scene must be recalculated relative to the new view. As can be imagined, a fast-paced game needing to maintain a high frame rate will require many calculations and many memory accesses.
Background: Texturing
There are different ways to add complexity to a 3D scene. Creating more and more detailed models, consisting of a greater number of polygons, is one way to add visual interest to a scene. However, adding polygons necessitates paying the price of having to manipulate more geometry. 3D systems have what is known as a “polygon budget,” an approximate number of polygons that can be manipulated without unacceptable performance degradation. In general, fewer polygons yield higher frame rates.
The visual appeal of computer graphics rendering is greatly enhanced by the use of “textures”. A texture is a two-dimensional image which is mapped into the data to be rendered. Textures provide a very efficient way to generate the level of minor surface detail which makes synthetic images realistic, without requiring transfer of immense amounts of data. Texture patterns provide realistic detail at the sub-polygon level, so the higher-level tasks of polygon-processing are not overloaded. See Foley et al., Computer Graphics: Principles and Practice (2.ed. 1990, corr. 1995), especially at pages 741-744; Paul S. Heckbert, “Fundamentals of Texture Mapping and Image Warping,” Thesis submitted to Dept. of EE and Computer Science, University of California, Berkeley, Jun. 17, 1994; Heckbert, “Survey of Computer Graphics,” IEEE Computer Graphics, November 1986, pp. 56; all of which are hereby incorporated by reference. Game programmers have also found that texture mapping is generally a very efficient way to achieve very dynamic images without requiring a hugely increased memory bandwidth for data handling.
A typical graphics system reads data from a texture map, processes it, and writes color data to display memory. The processing may include mipmap filtering which requires access to several maps. The texture map need not be limited to colors, but can hold other information that can be applied to a surface to affect its appearance; this could include height perturbation to give the effect of roughness. The individual elements of a texture map are called “texels”.
Awkward side-effects of texture mapping occur unless the renderer can apply texture maps with correct perspective. Perspective-corrected texture mapping involves an algorithm that translates “texels” (pixels from the bitmap texture image) into display pixels in accordance with the spatial orientation of the surface. Since the surfaces are transformed (by the host or geometry engine) to produce a 2D view, the textures will need to be similarly transformed by a linear transform (normally projective or “affine”). (In conventional terminology, the coordinates of the object surface, i.e. the primitive being rendered, are referred to as an (s,t) coordinate space, and the map of the stored texture is referred to a (u,v) coordinate space.) The transformation in the resulting mapping means that a horizontal line in the (x,y) display space is very likely to correspond to a slanted line in the (u,v) space of the texture map, and hence many additional reads will occur, due to the texturing operation, as rendering walks along a horizontal line of pixels.
One of the requirements of many 3-D graphics applications (especially gaming applications) is fill and texturing rates. Gaming and DCC (digital content creation) applications use complex textures, and may often use multiple textures with a single primitive. (CAD and similar workstation applications, by contrast, make much less use of textures, and typically use smaller polygons but more of them.) Achieving an adequately high rate of texturing and fill operations requires a very large memory bandwidth.
Background: Binning
A tiled, binning, chunking, or bucket rendering architecture is where the primitives are sorted into screen regions before they are rendered. This architecture allows all the primitives within a screen region to be rendered together to exploit the higher locality of reference to the z and color buffers, thereby allowing more efficient memory usage typically by using only on-chip memory. This also enables other whole-scene rendering opportunities such as deferred-rendering, order-independent transparency, and new types of antialiasing. In the present application, “transparent” is used generally to designate anything with alpha <1.
The primitives and state are recorded in a spatial database in memory that represents the frame being rendered. This is done after any T&L processing so everything is in screen coordinates. Ideally, no rendering occurs until the frame is complete; however, it will be done early on a user flush if the amount of binned data exceeds a programmable threshold or if the memory set aside to hold the database is exhausted. While the database for one frame is being constructed, the database for an earlier frame will be rendered.
The screen is divided up into rectangular regions called bins, and each bin heads a linked list of bin records that hold the state and primitives that overlap with this bin region. A primitive and its associated state may be repeated across several bins. Vertex data is held separately and is not replicated when a primitive overlaps multiple bins to allow more efficient storage mechanisms to be used. Primitives are maintained in temporal order within a bin.
Opaque primitives can be rendered in any order and are usually rendered in the order the primitives are submitted. Generally, the depth test ensures that the final result is the same. However, different rendering orders of co-planar polygons will give different results.
To render transparent primitives correctly, they need to be drawn either in a front-to-back or back-to-front order after all the opaque primitives have been rendered. The application sorts the transparent primitives into order before submitting them for rendering, and there are two basic algorithms used:
The application can sort the transparent primitives in a manner similar to the Painter's algorithm (an early method for hidden surface removal). There may be no correct rendering order when transparent primitives are cyclically interleaved or penetrated, and in these cases, the application would need to clip the primitives against each other to generate a definitive order.
The application can submit the transparent primitives multiple times with a dual depth test to render the transparent surfaces one layer at a time. A layer is the set of farthest transparent primitives (or parts there of) that are in front of the nearest opaque primitives. After each layer is rendered, it is incorporated into the opaque primitives for the next pass. Subsequent layers move closer to the eye position. This technique is called depth peeling. Alternatively, it can be implemented with subsequent layers moving farther away from the eye; however, this requires a triple depth test and is more expensive to render, but has the advantage of terminating early once a certain number of layers has been rendered (extra layers add very little to the fidelity of the image).
Binning has the following benefits:
                Reduces the rendering bandwidth by keeping all the depth and color data on-chip except for the final write to memory once a bin has been processed. For aliased rendering, the frame buffer bandwidth is, therefore, a constant one-pixel write per frame irrespective of overdraw or the amount of alpha-blending or depth read-modify-write operations. Also, note that in many cases, there is no need to save the depth buffer to memory, thereby halving the bandwidth. For full scene antialiasing (FSAA), this is even more dramatic as approximately 4× more reads and writes occur while rendering (assuming 4-sample FSAA). The down-sampling also is done from on-chip memory so the bandwidth demand remains the same as in the non-FSAA case. Some of these bandwidth savings are lost due to the bandwidth needed to build and parse the bin data structures, and this will be exacerbated with FSAA as the caches will cover a smaller area of screen (the database will be traversed more times). The over all bandwidth saving is scene and triangle-size dependent.        Fragment computations or texturing is saved by using deferred rendering. A bin is traversed twice—on the first (but simpler pass), the visibility buffer is set up, and no color calculations are done. On the second pass, only those fragments determined to be visible are rendered—effectively reducing the opaque depth complexity to 1. As most games have an average depth complexity >3, this can give up to a 3× or more boost to the apparent fill rate (depending on the original primitive submission order).        Less FSAA work. During the first pass of the deferred rendering operation, the location of edges (geometric and inferred due to penetrating faces) can be ascertained, and only those sub-tiles holding edges need to have the multi-sample depth values calculated and the color replicated to the covered sample points. This saves cycles to update the multi-sample buffers and any program cost for alpha-blending.        Stochastic super sampling FSAA. The contents of a bin are rendered multiple times with the post-transformed primitives being jittered per pass. This is similar to accumulation buffering at the application level but occurs without any application involvement (motion blur and depth of field effects cannot be done). It has superior quality and smaller memory footprint than multi-sample FSAA; however, it is slower as the color is computed at each sample point (unlike multi-sample where one color per fragment is calculated).        The T&L and rasterization work proceed in parallel with no fine grain dependencies so a bottle neck in one part will not stall the other. This will still happen at frame granularity, but within a frame, the work flow will be much smoother.        Memory footprint can be reduced when the depth buffer does not need to be saved to memory. With FSAA, the depth and color sample buffers are rarely needed after the filtered color has been determined. Note that as all the memory is virtual, space can be allocated for these buffers (in case of a premature flush), but the demand will only be made on the working set if a flush occurs. Note that the semantics of OpenGL can make this hard to use.State Tracking Methodology        
Redundant changes of tracked state issued by the application are filtered out by comparing the new state value with the old value, and if they are the same, no update is made.
State changes are collected in on-chip memory and added to the bin if the state vector associated with the bin is out of date. State changes within a bin are done incrementally in temporal order, and a bin is only brought up to date prior to adding in a new primitive if the state has changed since the last primitive was added to it.
To determine when the state in a bin needs to be updated, each item of state has a timestamp associated with it, and this is updated whenever that state is received. The state items are sorted in temporal order with the most recent state items first. Each bin has a timestamp to indicate when it was last updated. When a bin's state is to be updated, the bin's timestamp is compared against the state timestamp, working from most recent backwards, and the more recent state is copied to the bin. The state items will be added in reverse of the temporal order that they were updated in (which should not cause a problem), and once this has completed, the bin's timestamp is updated. The timestamp is reset at the start of every frame and incremented on the first primitive after a series of state changes. The sorting is done by a double-linked list, and new state items are moved to the head.
In addition to the above-listed advantages, the disclosed innovations, in various embodiments, also provide one or more of at least the following advantages:                Increased speed.        Increased efficiency.        Compatible with OpenGL and similar AGI's.        