This invention relates to computing systems generally, to three-dimensional computer graphics, more particularly, and more most particularly to structure and method for a three-dimensional graphics processor implementing differed shading and other enhanced features.
Computer graphics is the art and science of generating pictures with a computer. Generation of pictures, or images, is commonly called rendering. Generally, in three-dimensional (3D) computer graphics, geometry that represents surfaces (or volumes) of objects in a scene is translated into pixels stored in a frame buffer, and then displayed on a display device. Real-time display devices, such as CRTs used as computer monitors, refresh the display by continuously displaying the image over and over. This refresh usually occurs row-by-row, where each row is called a raster line or scan line. In this document, raster lines are numbered from bottom to top, but are displayed in order from top to bottom.
In a 3D animation, a sequence of images is displayed, giving the illusion of motion in three-dimensional space. Interactive 3D computer graphics allows a user to change his viewpoint or change the geometry in real-time, thereby requiring the rendering system to create new images on-the-fly in real-time.
In 3D computer graphics, each renderable object generally has its own local object coordinate system, and therefore needs to be translated (or transformed) from object coordinates to pixel display coordinates. Conceptually, this is a 4-step process: 1) translation (including scaling for size enlargement or shrink) from object coordinates to world coordinates, which is the coordinate system for the entire scene; 2) translation from world coordinates to eye coordinates, based on the viewing point of the scene; 3) translation from eye coordinates to perspective translated eye coordinates, where perspective scaling (farther objects appear smaller) has been performed; and 4) translation from perspective translated eye coordinates to pixel coordinates, also called screen coordinates. Screen coordinates are points in three-dimensional space, and can be in either screen-precision (i.e., pixels) or object-precision (high precision numbers, usually floating-point), as described later. These translation steps can be compressed into one or two steps by precomputing appropriate translation matrices before any translation occurs. Once the geometry is in screen coordinates, it is broken into a set of pixel color values (that is xe2x80x9crasterizedxe2x80x9d) that are stored into the frame buffer. Many techniques are used for generating pixel color values, including Gouraud shading, Phong shading, and texture mapping.
A summary of the prior art rendering process can be found in: xe2x80x9cFundamentals of Three-dimensional Computer Graphicsxe2x80x9d, by Watt, Chapter 5: The Rendering Process, pages 97 to 113, published by Addison-Wesley Publishing Company, Reading, Mass., 1989, reprinted 1991, ISBN 0-201-15442-0 (hereinafter referred to as the Watt Reference).
FIG. 1 shows a three-dimensional object, a tetrahedron, with its own coordinate axes (xobj,yobj,zobj). The three-dimensional object is translated, scaled, and placed in the viewing point""s coordinate system based on (xeye,yeye,zeye). The object is projected onto the viewing plane, thereby correcting for perspective. At this point, the object appears to have become two-dimensional; however, the object""s z-coordinates are preserved so they can be used later by hidden surface removal techniques. The object is finally translated to screen coordinates, based on (xscreen,yscreenzscreen), where Zscreen is going perpendicularly into the page. Points on the object now have their x and y coordinates described by pixel location (and fractions thereof) within the display screen and their z coordinates in a scaled version of distance from the viewing point.
Because many different portions of geometry can affect the same pixel, the geometry representing the surfaces closest to the scene viewing point must be determined. Thus, for each pixel, the visible surfaces within the volume subtended by the pixel""s area determine the pixel color value, while hidden surfaces are prevented from affecting the pixel. Non-opaque surfaces closer to the viewing point than the closest opaque surface (or surfaces, if an edge of geometry crosses the pixel area) affect the pixel color value, while all other non-opaque surfaces are discarded. In this document, the term xe2x80x9coccludedxe2x80x9d is used to describe geometry which is hidden by other non-opaque geometry.
Many techniques have been developed to perform visible surface determination, and a survey of these techniques are incorporated herein by reference to: xe2x80x9cComputer Graphics: Principles and Practicexe2x80x9d, by Foley, van Dam, Feiner, and Hughes, Chapter 15: Visible-Surface Determination, pages 649 to 720, 2nd edition published by Addison-Wesley Publishing Company, Reading, Mass., 1990, reprinted with corrections 1991, ISBNO-201-12110-7 (hereinafter referred to as the Foley Reference). In the Foley Reference, on page 650, the terms xe2x80x9cimage-precisionxe2x80x9d and xe2x80x9cobject-precisionxe2x80x9d are defined: xe2x80x9cImage-precision algorithms are typically performed at the resolution of the display device, and determine the visibility at each pixel. Object-precision algorithms are performed at the precision with which each object is defined, and determine the visibility of each object.xe2x80x9d
As a rendering process proceeds, most prior art renderers must compute the color value of a given screen pixel multiple times because multiple surfaces intersect the volume subtended by the pixel. The average number of times a pixel needs to be rendered, for a particular scene, is called the depth complexity of the scene. Simple scenes have a depth complexity near unity, while complex scenes can have a depth complexity of ten or twenty. As scene models become more and more complicated, renderers will be required to process scenes of ever increasing depth complexity. Thus, for most renders, the depth complexity of a scene is a measure of the wasted processing. For example, for a scene with a depth complexity of ten, 90% of the computation is wasted on hidden pixels. This wasted computation is typical of hardware renderers that use the simple Z-buffer technique (discussed later herein), generally chosen because it is easily built in hardware. Methods more complicated than the Z Buffer technique have heretofore generally been too complex to build in a cost-effective manner. An important feature of the method and apparatus invention presented here is the avoidance of this wasted computation by eliminating hidden portions of geometry before they are rasterized, while still being simple enough to build in cost-effective hardware.
When a point on a surface (frequently a polygon vertex) is translated to screen coordinates, the point has three coordinates: 1) the x-coordinate in pixel units (generally including a fraction); 2) the y-coordinate in pixel units (generally including a fraction); and 3) the z-coordinate of the point in either eye coordinates, distance from the virtual screen, or some other coordinate system which preserves the relative distance of surfaces from the viewing point. In this document, positive z-coordinate values are used for the xe2x80x9clook directionxe2x80x9d from the viewing point, and smaller values indicate a position closer to the viewing point.
When a surface is approximated by a set of planar polygons, the vertices of each polygon are translated to screen coordinates. For points in or on the polygon (other than the vertices), the screen coordinates are interpolated from the coordinates of vertices, typically by the processes of edge walking and span interpolation. Thus, a z-coordinate value is generally included in each pixel value (along with the color value) as geometry is rendered.
Many hardware renderers have been developed, and an example is incorporated herein by reference: xe2x80x9cLeo: A System for Cost Effective 3D Shaded Graphicsxe2x80x9d, by Deering and Nelson, pages 101 to 108 of SIGGRAPH93 Proceedings, Aug. 1-6 1993, Computer Graphics Proceedings, Annual Conference Series, published by ACM SIGGRAPH, New York, 1993, Softcover ISBN 0-201-58889-7 and CD-ROM ISBN 0-201-56997-3 (hereinafter referred to as the Deering Reference). The Deering Reference includes a diagram of a generic 3D graphics pipeline (i.e., a renderer, or a rendering system) that it describes as xe2x80x9ctruly generic, as at the top level nearly every commercial 3D graphics accelerator fits this abstractionxe2x80x9d, and this pipeline diagram is reproduced here as FIG. 2. Such pipeline diagrams convey the process of rendering, but do not describe any particular hardware. This document presents a new graphics pipeline that shares some of the steps of the generic 3D graphics pipeline. Each of the steps in the generic 3D graphics pipeline will be briefly explained here, and are also shown in the method flow diagram of FIG. 3. Processing of polygons is assumed throughout this document, but other methods for describing 3D geometry could be substituted. For simplicity of explanation, triangles are used as the type of polygon in the described methods.
As seen in FIG. 2, the first step within the floating-point intensive functions of the generic 3D graphics pipeline after the data input (Step 212) is the transformation step (Step 214), which was described above. The transformation step is also shown in FIG. 3 as the first step in the outer loop of the method flow diagram, and also includes xe2x80x9cget next polygonxe2x80x9d. The second step, the clip test, checks the polygon to see if it is at least partially contained in the view volume (sometimes shaped as a frustum) (Step 216). If the polygon is not in the view volume, it is discarded; otherwise processing continues. The third step is face determination, where polygons facing away from the viewing point are discarded (Step 218). Generally, face determination is applied only to objects that are closed volumes. The fourth step, lighting computation, generally includes the set up for Gouraud shading and/or texture mapping with multiple light sources of various types, but could also be set up for Phong shading or one of many other choices (Step 222). The fifth step, clipping, deletes any portion of the polygon that is outside of the view volume because that portion would not project within the rectangular area of the viewing plane (Step 224). Generally, polygon clipping is done by splitting the polygon into two smaller polygons that both project within the area of the viewing plane. Polygon clipping is computationally expensive. The sixth step, perspective divide, does perspective correction for the projection of objects onto the viewing plane (Step 226). At this point, the points representing vertices of polygons are converted to pixel space coordinates by step seven, the screen space conversion step (Step 228). The eighth step (Step 230), set up for incremental render, computes the various begin, end, and increment values needed for edge walking and span interpolation (e.g.: x, y, and z-coordinates; RGB color; texture map space u and v-coordinates; and the like).
Within the drawing intensive functions, edge walking (Step 232) incrementally generates horizontal spans for each raster line of the display device by incrementing values from the previously generated span (in the same polygon), thereby xe2x80x9cwalkingxe2x80x9d vertically along opposite edges of the polygon. Similarly, span interpolation (Step 234) xe2x80x9cwalksxe2x80x9d horizontally along a span to generate pixel values, including a z-coordinate value indicating the pixel""s distance from the viewing point. Finally, the z-buffered blending also referred to as Testing and Blending (Step 236) generates a final pixel color value. The pixel values also include color values, which can be generated by simple Gouraud shading (i.e., interpolation of vertex color values) or by more computationally expensive techniques such as texture mapping (possibly using multiple texture maps blended together), Phong shading (i.e., per-fragment lighting), and/or bump mapping (perturbing the interpolated surface normal). After drawing intensive functions are completed, a double-buffered MUX output look-up table operation is performed (Step 238). In this figure the blocks with rounded corners typically represent functions or process operations, while sharp cornered rectangles typically represent stored data or memory.
By comparing the generated z-coordinate value to the corresponding value stored in the Z Buffer, the z-buffered blend either keeps the new pixel values (if it is closer to the viewing point than previously stored value for that pixel location) by writing it into the frame buffer, or discards the new pixel values (if it is farther). At this step, antialiasing methods can blend the new pixel color with the old pixel color. The z-buffered blend generally includes most of the per-fragment operations, described below.
The generic 3D graphics pipeline includes a double buffered frame buffer, so a double buffered MUX is also included. An output lookup table is included for translating color map values. Finally, digital to analog conversion makes an analog signal for input to the display device.
A major drawback to the generic 3D graphics pipeline is its drawing intensive functions are not deterministic at the pixel level given a fixed number of polygons. That is, given a fixed number of polygons, more pixel-level computation is required as the average polygon size increases. However, the floating-point intensive functions are proportional to the number of polygons, and independent of the average polygon size. Therefore, it is difficult to balance the amount of computational power between the floating-point intensive functions and the drawing intensive functions because this balance depends on the average polygon size.
Prior art Z Buffers are based on conventional Random Access Memory (RAM or DRAM), Video RAM (VRAM), or special purpose DRAMs. One example of a special purpose DRAM is presented in xe2x80x9cFBRAM: A new Form of Memory Optimized for 3D Graphicsxe2x80x9d, by Deering, Schlapp, and Lavelle, pages 167 to 174 of SIGGRAPH94 Proceedings, Jul. 24-29, 1994, Computer Graphics Proceedings, Annual Conference Series, published by ACM SIGGRAPH, New York, 1994, Softcover ISBN 0201607956.
OpenGL is a software interface to graphics hardware which consists of several hundred functions and procedures that allow a programmer to specify objects and operations to produce graphical images. The objects and operations include appropriate characteristics to produce color images of three-dimensional objects. Most of OpenGL (Version 1.2) assumes or requires a that the graphics hardware include a frame buffer even though the object may be a point, line, polygon, or bitmap, and the operation may be an operation on that object. The general features of OpenGL Oust one example of a graphical interface) are described in the reference xe2x80x9cThe OpenGL(copyright) Graphics System: A Specification (Version 1.2) edited by Mark Segal and Kurt Akeley, Version 1.2, March 1998; and hereby incorporated by reference. Although reference is made to OpenGL, the invention is not limited to structures, procedures, or methods which are compatible or consistent with OpenGL, or with any other standard or non-standard graphical interface. Desirably, the inventive structure and method may be implemented in a manner that is consistent with the OpenGL, or other standard graphical interface, so that a data set prepared for one of the standard interfaces may be processed by the inventive structure and method without modification. However, the inventive structure and method provides some features not provided by OpenGL, and even when such generic input/output is provided, the implementation is provided in a different manner.
The phrase xe2x80x9cpipeline statexe2x80x9d does not have a single definition in the prior-art. The OpenGL specification, for example, sets forth the type and amount of the graphics rendering machine or pipeline state in terms of items of state and the number of bits and bytes required to store that state information. In the OpenGL definition, pipeline state tends to include object vertex pertinent information including for example, the vertices themselves the vertex normals, and color as well as xe2x80x9cnon-vertexxe2x80x9d information.
When information is sent into a graphics renderer, at least some object geometry information is provided to describe the scene. Typically, the object or objects are specified in terms of vertex information, where an object is modeled, defined, or otherwise specified by points, lines, or polygons (object primitives) made up of one or more vertices. In simple terms, a vertex is a location in space and may be specified for example by a three-space (x,y,z) coordinate relative to some reference origin. Associated with each vertex is other information, such as a surface normal, color, texture, transparency, and the like information pertaining to the characteristics of the vertex. This information is essentially xe2x80x9cper-vertexxe2x80x9d information. Unfortunately, forcing a one-to-one relationship between incoming information and vertices as a requirement for per-vertex information is unnecessarily restrictive. For example, a color value may be specified in the data stream for a particular vertex and then not respecified in the data stream until the color changes for a subsequent vertex. The color value may still be characterized as per-vertex data even though a color value is not explicitly included in the incoming data stream for each vertex.
Texture mapping presents an interesting example of information or data which could be considered as either per-vertex information or pipeline state information. For each object, one or more texture maps may be specified, each texture map being identified in some manner, such as with a texture coordinate or coordinates. One may consider the texture map to which one is pointing with the texture coordinate as part of the pipeline state while others might argue that it is per-vertex information.
Other information, not related on a one-to-one basis to the geometry object primitives, used by the renderer such as lighting location and intensity, material settings, reflective properties, and other overall rules on which the renderer is operating may more accurately be referred to as pipeline state. One may consider that everything that does not or may not change on a per-vertex basis is pipeline state, but for the reasons described, this is not an entirely unambiguous definition. For example, one may define a particular depth test (See later description) to be applied to certain objects to be rendered, for example the depth test may require that the z-value be strictly xe2x80x9cgreater-thanxe2x80x9d for some objects and xe2x80x9cgreater-than-or-equal-toxe2x80x9d for other objects. These particular depth tests which change from time to time, may be considered to be pipeline state at that time.
Parameters considered to be renderer (pipeline) state in OpenGL are identified in Section 6.2 of the afore referenced OpenGL Specification (Version 1.2, at pages 193-217).
Essentially then, there are two types of data or information used by the renderer: (1) primitive data which may be thought of as per-vertex data, and (ii) pipeline state data (or simply pipeline state) which is everything else. This distinction should be thought of as a guideline rather than as a specific rule, as there are ways of implementing a graphics renderer treating certain information items as either pipeline state or non-pipeline state.
In the generic 3D graphics pipeline, the xe2x80x9cz-buffered blendxe2x80x9d step actually incorporates many smaller xe2x80x9cper-fragmentxe2x80x9d operational steps.
Application Program Interfaces (APIs), such as OpenGL (Open Graphics Library) and D3D, define a set of per-fragment operations (See Chapter 4 of Version 1.2 OpenGL Specification). We briefly review some exemplary OpenGL per-fragment operations so that any generic similarities and differences between the inventive structure and method and conventional structures and procedures can be more readily appreciated.
Under OpenGL, a frame buffer stores a set of pixels as a two-dimensional array. Each picture-element or pixel stored in the frame buffer is simply a set of some number of bits. The number of bits per pixel may vary depending on the particular GL implementation or context.
Corresponding bits from each pixel in the framebuffer are grouped together into a bitplane; each bitplane containing a single bit from each pixel. The bitplanes are grouped into several logical buffers referred to as the color, depth, stencil, and accumulation buffers. The color buffer in turn includes what is referred to under OpenGI as the front left buffer, the front right buffer, the back left buffer, the back right buffer, and some additional auxiliary buffers. The values stored in the front buffers are the values typically displayed on a display monitor while the contents of the back buffers and auxiliary buffers are invisible and not displayed. Stereoscopic contexts display both the front left and the front right buffers, while monoscopic contexts display only the front left buffer. In general, the color buffers must have the same number of bitplanes, but particular implementations of context may not provide right buffers, back buffers, or auxiliary buffers at all, and an implementation or context may additionally provide or not provide stencil, depth, or accumulation buffers.
Under OpenGL, the color buffers consist of either unsigned integer color indices or R, G, B, and, optionally, a number xe2x80x9cAxe2x80x9d of unsigned integer values; and the number of bitplanes in each of the color buffers, the depth buffer (if provided), the stencil buffer (if provided), and the accumulation buffer (if provided), is fixed and window dependent. If an accumulation buffer is provided, it should have at least as many bit planes per R, G, and B color component as do the color buffers.
A fragment produced by rasterization with window coordinates of (xw,yw) modifies the pixel in the framebuffer at that location based on a number of tests, parameters, and conditions. Noteworthy among the several tests that are typically performed sequentially beginning with a fragment and its associated data and finishing with the final output stream to the frame buffer are in the order performed (and with some variation among APIs): 1) pixel ownership test; 2) scissor test; 3) alpha test; 4) Color Test; 5) stencil test; 6) depth test; 7) blending; 8) dithering; and 9) logicop. Note that the OpenGL does not provide for an explicit xe2x80x9ccolor testxe2x80x9d between the alpha test and stencil test. Per-Fragment operations under OpenGL are applied after all the color computations. Each of these tests or operations is briefly described below.
Under OpenGL, the pixel ownership test determines if the pixel at location (x2, yw) in the framebuffer is currently owned by the GL context. If it is not, the window system decides the fate of the incoming fragment. Possible results are that the fragment is discarded or that some subset of the subsequent per-fragment operations are applied to the fragment. This pixel ownership test allows the window system to properly control the GL""s behavior.
Assume that in a computer having a display screen, one or several processes are running and that each process has a window on the display screen. For each process, the associated window defines the pixels the process wants to write or render to. When there are two or more windows, the window associated with one process may be in front of the window associated with another process, behind that window, or both windows may be entirely visible. Since there is only a single frame buffer for the entire display screen or desktop, the pixel ownership test involves determining which process and associated window owns each of the pixels. If a particular process does not xe2x80x9cownxe2x80x9d a pixel, it fails the pixel ownership test relative to the frame buffer and that pixel is thrown away. Note that under the typical paradigm, the pixel ownership test is run by each process, and that for a give pixel location in the frame buffer, that pixel may pass the pixel ownership test for one of the processes, and fail the pixel ownership test for the other process. Furthermore, in general, a particular pixel can pass the ownership test for only one process because only one process can own a particular frame buffer pixel at the same time.
In some rendering schemes the pixel ownership test may not be particularly relevant. For example, if the scene is being rendered to an off-screen buffer, and subsequently Block Transferred or xe2x80x9cblittedxe2x80x9d to the desktop, pixel ownership is not really even relevant. Each process automatically or necessarily passes the pixel ownership test (if it is executed) because each process effectively owns its own off-screen buffer and nothing is in front of that buffer.
If for a particular process, the pixel is not owned by that process, then there is no need to write a pixel value to that location, and all subsequent processing for that pixel may be ignored. In a typical workstation, all the data associated with a particular pixel on the screen is read during rasterization. All information for any polygon that feeds that pixel is read, including information as to the identity of the process that owns that frame buffer pixel, as well as the z-buffer, the color value, the old color value, the alpha value, stencil bits, and so forth. If a process owns the pixel, then the other downstream process are executed (for example, scissor test, alpha test, and the like) On the other hand, if the process does not own the pixel and fails the ownership test for that pixel, the process need not consider that pixel further and that pixel is skipped for subsequent tests.
Under OpenGL, the scissor test determines if (xw, yw) lies within a scissor rectangle defined by four coordinate values corresponding to a left bottom (left, bottom) coordinate, a width of the rectangle, and a height of the rectangle. The values are set with the procedure xe2x80x9cvoid Scissor( int left, int bottom, sizei width, sizei height)xe2x80x9d under OpenGL. If left xe2x89xa6xw less than left+width and bottom xe2x89xa6yw less than bottom+height, then the scissor test passes; otherwise the scissor test fails and the particular fragment being tested is discarded. Various initial states are provided and error conditions monitored and reported.
In simple terms, a rectangle defines a window which may be an on-screen or off-screen window. The window is defined by an x-left, x-right, y-top, and y-bottom coordinate (even though it may be expressed in terms of a point and height and width dimensions from that point). This scissor window is useful in that only pixels from a polygon fragment that fall in that screen aligned scissor window will change. In the event that a polygon straddles the scissor window, only those pixels that are inside the scissor window may change.
When a polygon in an OpenGL machine comes down the pipeline, the pipeline calculates everything it needs to in order to determine the z-value and color of that pixel. Once z-value and color are determined, that information is used to determine what information should be placed in the frame buffer (thereby determining what is displayed on the display screen).
Just as with the pixel ownership test, the scissor test provides means for discarding pixels and/or fragments before they actually get to the frame buffer to cause the output to change.
Color is defined by four values, red (R), green (G), blue (B), and alpha (A). The RGB values define the contribution from each of the primary colors, and alpha is related to the transparency. Typically, color is a 32-bit value, 8-bits for each component, though such representation is not limited to 32-bits. Alpha test compares the alpha value of a given pixel to an alpha reference value. The type of comparison may also be specified, so that for example the comparison may be a greater-than operation, a less-than operation, and so forth. If the comparison is a greater-than operation, then the pixel""s alpha value has to be greater than the reference to pass the alpha test. So if the pixel""s alpha value is 0.9, the reference alpha is 0.8, and the comparison is greater-than, then that pixel passes the alpha test. Any pixel not passing the alpha test is thrown away or discarded. The OpenGL Specification describes the manner in which alpha test is implemented in OpenGL, and we do not describe it further here.
Alpha test is a per-fragment operation and happens after all of the fragment coloring calculations and lighting and shading operations are completed. Each of these per-fragment operations may be though of as part of the conventional z-buffer blending operations.
Color test is similar to the alpha test described hereinbefore, except that rather than performing the magnitude or logical comparisons between the pixel alpha (A) value and a reference value, the color test performs a magnitude or logical comparison between one or a combination of the R, G, or B color components and reference value(s). The comparison test may be for example, greater-than, less-than, equal-to, greater-than-or-equal-to, xe2x80x9cgreater-than-c1 and less-than c2xe2x80x9d where c1 and c2 are sore predetermined reference values, and so forth. One might for example, specify a reference minimum R value, and a reference maximum R value, such that the color test would be passed only if the pixel R value is between that minimum and maximum. Color test might, for example, be useful to provide blue-screen functionality. The comparison test may also be performed on a single color component or on a combination of color components. Furthermore, although for the alpha test one typically has one value for each component, for the color test there are effectively two values per component, a maximum value and a minimum value.
Under OpenGL, stencil test conditionally discards a fragment based on the outcome of a comparison between a value stored in a stencil buffer at location (xw, yw) and a reference value. Several stencil comparison functions are permitted such that the stencil test passes never, always, if the reference value is less than, less than or equal to, equal to, greater than or equal to, greater than, or not equal to the masked stored value in the stencil buffer. The Under OpenGL, if the stencil test fails, the incoming fragment is discarded. The reference value and the comparison value can have multiple bits, typically 8 bits so that 256 different values may be represented. When an object is rendered into the frame buffer, a tag having the stencil bits is also written into the frame buffer. These stencil bits are part of the pipeline state. The type of stencil test to perform can be specified at the time the geometry is rendered.
The stencil bits are used to implement various filtering, masking or stenciling operations. For example, if a particular fragment ends up affecting a particular pixel in the frame buffer, then the stencil bits can be written to the frame buffer along with the pixel information.
Under OpenGL, the depth buffer test discards the incoming fragment if a depth comparison fails. The comparison is enabled or disabled with the generic Enable and Disable commands using the OpenGL symbolic constant DEPTH_TEST. When depth test is disabled, the depth comparison and subsequent possible updates to the depth buffer value are bypassed and a fragment is passed to the next operation. The stencil bits are also involved and are modified even if the test is bypassed. The stencil value is modified if the depth buffer test passed. If depth test is enabled, the depth comparison takes place and the depth buffer and stencil value may subsequently be modified. The manner in which the depth test is implemented in OpenGL is described in greater detail in the OpenGL specification at page 145.
Depth comparisons are implemented in which possible outcomes are as follows: the depth buffer test passes never, always, if the incoming fragment""s Z, value is less than, less than or equal to, equal to, greater than, greater than or equal to, or not equal to the depth value stored at the location given by the incoming fragment""s (xw, yw) coordinates. If the depth buffer test fails, the incoming fragment is discarded. The stencil value at the fragment""s(xw, yw) coordinate is updated according to the function currently in effect for depth buffer test failure. Otherwise, the fragment continues to the next operation and the value of the depth buffer at the fragment""s (xw, yw) location is set to the fragment""s z, value. In this case the stencil value is updated according to the function currently in effect for depth buffer test success. The necessary OpenGL state is an eight-valued integer and a single bit indicating whether depth buffering is enabled or disabled.
Under OpenGL, blending combines the incoming fragment""s R, G, B, and A values with the R, G, B, and A values stored in the framebuffer at the incoming fragment""s (Xw, Yw) location.
This blending is typically dependent on the incoming fragment""s alpha value (A) and that of the corresponding frame buffer stored pixel. In the following discussion, Cs refers to the source color for an incoming fragment, Cd refers to the destination color at the corresponding framebuffer location, and Cc refers to a constant color in-the GL state. Individual RGBA components of these colors are denoted by subscripts of s, d, and c respectively.
Blending is basically an operation that takes color in the frame buffer and the color in the fragment, and blends them together. The manner in which blending is achieved, that is the particular blending function, may be selected from various alternatives for both the source and destination.
Blending is described in the OpenGL specification at page 146-149 and is hereby incorporated by reference. Various blend equations are available under OpenGL. For example, an additive type blend is available wherein a blend result (C) is obtained by adding the product of a source color (Cs) by a source weighting factor quadruplet (S) to the product of a destination color (Cd) and a destination weighting factor (D) quadruplet, that is C=CsS+CdD. Alternatively, the blend equation may be a subtraction (C=CsSxe2x88x92CdD), a reverse subtraction (C=CdDxe2x88x92CsS), a minimum function (C=min(Cs, Cd)), a maximum function (C=max(Cs, Cd)). Under OpenGL, the blending equation is evaluated separately for each color component and its corresponding weighting coefficient. Each of the four R, G, B, A components has its own weighting factor.
The blending test (or blending equation) is part of pipeline state and can potentially change for every polygon, but more typically would chang only for the object made up or several polygons.
In generally, blending is only performed once other tests such as the pixel ownership test and stencil test have been passed so that it is clear that the pixel or fragment under consideration would or could have an effect in the output.
Under OpenGL, dithering selects between two color values or indices. In RGBA mode, consider the value of any of the color components as a fixed-point value with m bits to the left of the binary point, where m is the number of bits allocated to that component in the framebuffer; call each such value c. For each c, dithering selects a value c1 such that c1xcex5 {max{0,[c]-1, [c]}. This selection may depend on the xw and yw coordinates of the pixel. In color index mode, the same rule applies with c being a single color index. The value of c must not be larger than the maximum value representable in the framebuffer for either the component or the index.
Although many dithering algorithms are possible, a dithered value produced by any algorithm must generally depend only the incoming value and the fragment""s x and y window coordinates. When dithering is disabled, each color component is truncated to a fixed-point value with as many bits as there are in the corresponding framebuffer component, and the color index is rounded to the nearest integer representable in the color index portion of the framebuffer.
The OpenGL Specification of dithering is described more fully in the OpenGL specification, particularly at pages 149-150, which are incorporated by reference.
Under OpenGL, there is a final logical operation applied between the incoming fragment""s color or index values and the color or index values stored in the frame buffer at the corresponding location. The result of the logical operation replaces the values in the framebuffer at the fragment""s (x, y) coordinates. Various logical operations may be implemented between source (s) and destination (d), including for example: clear, set, and, noop, xor, or, nor, nand, invert, copy, inverted and, equivalence, reverse or, reverse and, inverted copy, and inverted or. The logicop arguments and corresponding operations, as well as additional details of the OpenGL logicop implementation, are set forth in the OpenGL specification at pates 150-151. Logical operations are performed independently for each color index buffer that is selected for writing, or for each red, green, blue, and alpha value of each color buffer that is selected for writing. The required state is an integer indicating the logical operation, and two bits indicating whether the logical operation is enabled or disabled.
In this document, pixels are referred to as the smallest individually controllable element of the display device. But, because images are quantized into discrete pixels, spatial aliasing occurs. A typical aliasing artifact is a xe2x80x9cstaircasexe2x80x9d effect caused when a straight line or edge cuts diagonally across rows of pixels.
Some rendering systems reduce aliasing effects by dividing pixels into subpixels, where each sub-pixel can be colored independently. When the image is to be displayed, the colors for all sub-pixels within each pixel are blended together to form an average color for the pixel. A renderer that uses up to 16 sub-pixels per pixel is described in xe2x80x9cRealityEngine Graphicsxe2x80x9d, by Akeley, pages 109 to 116 of SIGGRAPH93 Proceedings, Aug. 1-6, 1993, Computer Graphics Proceedings, Annual Conference Series, published by ACM SIGGRAPH, New York, 1993, Softcover ISBN 0-201-58889-7 and CD-ROM ISBN 0-201-56997-3 (hereinafter referred to as the Akeley Reference).
Another prior art antialiasing method is the A-Buffer used to perform blending (this technique is also included in the Akeley Reference), and is described in xe2x80x9cThe A-buffer, an Antialiased Hidden Surface Methodxe2x80x9d by L. Carpenter, SIGGRAPH 1984 Conference Proceedings, pp.103-108 (hereinafter referred to as the Carpenter Reference). The A-buffer is an antialiasing technique that reduces aliasing by keeping track of the percent coverage of a pixel by a rendered polygon. The main drawback to this technique is the need to sort polygons front-to-back (or back-to-front) at each pixel in order to get acceptable antialiased polygons.
Most Content Addressable Memories (CAM) perform a bit-for-bit equality test between an input vector and each of the data words stored in the CAM. This type of CAM frequently provides masking of bit positions in order to eliminate the corresponding bit in all words from affecting the equality test. It is inefficient to perform magnitude comparisons in a equality-testing CAM because a large number of clock cycles is required to do the task. CAMs are presently used in translation look-aside buffers within a virtual memory systems in some computers. CAMs are also used to match addresses in high speed computer networks.
Magnitude Comparison CAM (MCCAM) is defined here as any CAM where the stored data are treated as numbers, and arithmetic magnitude comparisons (i.e. less-than, greater-than, less-than-or-equal-to, and the like) are performed on the data in parallel. This is in contrast to ordinary CAM which treats stored data strictly as bit vectors, not as numbers. An MCCAM patent, included herein by reference, is U.S. Pat. No. 4,996,666, by Jerome F. Duluk Jr., entitled xe2x80x9cContent-Addressable Memory System Capable of Fully Parallel Magnitude Comparisonsxe2x80x9d, granted Feb. 26, 1991 (hereinafter referred to as the Duluk Patent). Structures within the Duluk Patent specifically referenced shall include the prefix xe2x80x9cDuluk Patentxe2x80x9d (for example, ""Duluk Patent MCCAM Bit Circuitxe2x80x9d).
The basic internal structure of an MCCAM is a set of memory bits organized into words, where each word can perform one or more arithmetic magnitude comparisons between the stored data and input data. In general, for an MCCAM, when a vector of numbers is applied in parallel to an array of words, all arithmetic comparisons in all words occur in parallel. Such a parallel search comparison operation is called a xe2x80x9cqueryxe2x80x9d of the stored data.
The invention described here augments the capability of the MCCAM by adding various features, including the ability to output all the query result bits every clock cycle and to logically xe2x80x9corxe2x80x9d together these output query result bits to form additional outputs.
Overview of Aspects of the Invention - Top Level Summary
Computer graphics is the art and science of generating pictures or images with a computer. This picture generation is commonly referred to as rendering. The appearance of motion, for example in a 3-Dimensional animation is achieved by displaying a sequence of images. Interactive 3-Dimensional (3D) computer graphics allows a user to change his or her viewpoint or to change the geometry in real-time, thereby requiring the rendering system to create new images on-the-fly in real-time. Therefore, real-time performance in color, with high quality imagery is becoming increasingly important.
The invention is directed to a new graphics processor and method and encompasses numerous substructures including specialized subsystems, subprocessors, devices, architectures, and corresponding procedures. Embodiments of the invention may include one or more of deferred shading, a tiled frame buffer, and multiple-stage hidden surface removal processing, as well as other structures and/or procedures. In this document, this graphics processor is hereinafter referred to as the DSGP (for Deferred Shading Graphics Processor), or the DSGP pipeline, but is sometimes referred to as the pipeline.
This present invention includes numerous embodiments of the DSGP pipeline. Embodiments of the present invention are designed to provide high-performance 3D graphics with Phong shading, subpixel anti-aliasing, and texture- and bump-mapping in hardware. The DSGP pipeline provides these sophisticated features without sacrificing performance.
The DSGP pipeline can be connected to a computer via a variety of possible interfaces, including but not limited to for example, an Advanced Graphics Port (AGP) and/or a PCI bus interface, amongst the possible interface choices. VGA and video output are generally also included. Embodiments of the invention supports both OpenGL and Direct3D APIs. The OpenGL specification, entitled xe2x80x9cThe OpenGL Graphics System: A Specification (Version 1.2)xe2x80x9d by Mark Segal and Kurt Akeley, edited by Jon Leech, is included incorporated by reference.
Several exemplary embodiments or versions of a Deferred Shading Graphics Pipeline are now described.
Several versions or embodiments of the Deferred Shading Graphics Pipeline are described here, and embodiments having various combinations of features may be implemented. Furthermore, features of the invention may be implemented independently of other features. Most of the important features described above can be applied to all versions of the DSGP pipeline.
Each frame (also called a scene or user frame) of 3D graphics primitives is rendered into a 3D window on the display screen. A window consists of a rectangular grid of pixels, and the window is divided into tiles (hereinafter tiles are assumed to be 16xc3x9716 pixels, but could be any size). If tiles are not used, then the window is considered to be one tile. Each tile is further divided into stamps (hereinafter stamps are assumed to be 2xc3x972 pixels, thereby resulting in 64 stamps per tile, but stamps could be any size within a tile). Each pixel includes one or more of samples, where each sample has its own color values and z-value (hereinafter, pixels are assumed to include four samples, but any number could be used). A fragment is the collection of samples covered by a primitive within a particular pixel. The term xe2x80x9cfragmentxe2x80x9d is also used to describe the collection of visible samples within a particular primitive and a particular pixel.
In ordinary Z-buffer rendering, the renderer calculates the color value (RGB or RGBA) and z value for each pixel of each primitive, then compares the z value of the new pixel with the current z value in the Z-buffer. If the z value comparison indicates the new pixel is xe2x80x9cin front ofxe2x80x9d the existing pixel in the frame buffer, the new pixel overwrites the old one; otherwise, the new pixel is thrown away.
Z-buffer rendering works well and requires no elaborate hardware. However, it typically results in a great deal of wasted processing effort if the scene contains many hidden surfaces. In complex scenes, the renderer may calculate color values for ten or twenty times as many pixels as are visible in the final picture. This means the computational cost of any per-pixel operationxe2x80x94such as Phong shading or texture-mappingxe2x80x94is multiplied by ten or twenty. The number of surfaces per pixel, averaged over an entire frame, is called the depth complexity of the frame. In conventional z-buffered renderers, the depth complexity is a measure of the renderer""s inefficiency when rendering a particular frame.
In a pipeline that performs deferred shading, hidden surface removal (HSR) is completed before any pixel coloring is done. The objective of a deferred shading pipeline is to generate pixel colors for only those primitives that appear in the final image (i.e., exact HSR). Deferred shading generally requires the primitives to be accumulated before HSR can begin. For a frame with only opaque primitives, the HSR process determines the single visible primitive at each sample within all the pixels. Once the visible primitive is determined for a sample, then the primitive""s color at that sample location is determined. Additional efficiency can be achieved by determining a single per-pixel color for all the samples within the same pixel, rather than computing per-sample colors.
For a frame with at least some alpha blending (as defined in the afore referenced OpenGL specification) of primitives (generally due to transparency), there are some samples that are colored by two or more primitives. This means the HSR process must determine a set of visible primitives per sample.
In some APIs, such as OpenGL, the HSR process can be complicated by other operations (that is by operation other than depth test) that can discard primitives. These other operations include: pixel ownership test, scissor test, alpha test, color test, and stencil test (as described elsewhere in this specification). Some of these operations discard a primitive based on its color (such as alpha test), which is not determined in a deferred shading pipeline until after the HSR process (this is because alpha values are often generated by the texturing process, included in pixel fragment coloring). For example, a primitive that would normally obscure a more distant primitive (generally at a greater z-value) can be discarded by alpha test, thereby causing it to not obscure the more distant primitive. A HSR process that does not take alpha test into account could mistakenly discard the more distant primitive. Hence, there may be an inconsistency between deferred shading and alpha test (similarly, with color test and stencil test); that is, pixel coloring is postponed until after hidden surface removal, but hidden surface removal can depend on pixel colors. Simple solutions to this problem include: 1) eliminating non-depth-dependent tests from the API, such as alpha test, color test, and stencil test, but this potential solution might prevent existing programs from executing properly on the deferred shading pipeline; and 2) having the HSR process do some color generation, only when, needed, but this potential solution would complicate the data flow considerably. Therefore, neither of these choices is attractive. A third alternative, called conservative hidden surface removal (CHSR), is one of the important innovations provided by the inventive structure and method. CHSR is described in great detail in subsequent sections of the specification.
Another complication in many APIs is their ability to change the depth test. The standard way of thinking about 3D rendering assumes visible objects are closer than obscured objects (i.e., at lesser z-values), and this is accomplished by selecting a xe2x80x9cless-thanxe2x80x9d depth test (i.e., an object is visible if its z-value is xe2x80x9cless-thanxe2x80x9d other geometry). However, most APIs support other depth tests such as: greater-than, less-than, greater-than-or-equal-to, equal, less-than-or-equal-to, less-than, not-equal, and the like algebraic, magnitude, and logical relationships. This essentially xe2x80x9cchanges the rulesxe2x80x9d for what is visible. This complication is compounded by an API allowing the application program to change the depth test within a frame. Different geometry may be subject to drastically different rules for visibility. Hence, the time order of primitives with different rendering rules must be taken into account. For example, in the embodiment illustrated in FIG. 4, three primitives are shown with their respective depth test (only the z dimension is shown in the figure, so this may be considered the case for one sample). If they are rendered in the order A, B, then C, primitive B will be the final visible surface. However, if the primitives are rendered in the order C, B, then A, primitive A will be the final visible surface. This illustrates how a deferred shading pipeline must preserve the time ordering of primitives, and correct pipeline state (for example, the depth test) must be associated with each primitive.
A conventional 3D graphics pipeline is illustrated in FIG. 2. We now describe a first embodiment of the inventive 3D Deferred Shading Graphics Pipeline Version 1 (hereinafter xe2x80x9cDSGPv1xe2x80x9d), relative to FIG. 4. It will be observed that the inventive pipeline (FIG. 4) has been obtained from the generic conventional pipeline (FIG. 2) by replacing the drawing intensive functions 231 with: (1) a scene memory 250 for storing the pipeline state and primitive data describing each primitive, called scene memory in the figure; (2) an exact hidden surface removal process 251; (3) a fragment coloring process 252; and (4) a blending process 253.
The scene memory 250 stores the primitive data for a frame, along with their attributes, and also stores the various settings of pipeline state throughout the frame. Primitive data includes vertex coordinates, texture coordinates, vertex colors, vertex normals, and the like In DSGPv1, primitive data also includes the data generated by the setup for incremental render, which includes spatial, color, and edge derivatives.
When all the primitives in a frame have been processed by the floating-point intensive functions 213 and stored into the scene memory 250, then the HSR process commences. The scene memory 250 can be double buffered, thereby allowing the HSR process to perform computations on one frame while the floating-point intensive functions perform computations on the next frame. The scene memory can also be triple buffered. The scene memory could also be a scratchpad for the HSR process, storing intermediate results for the HSR process, allowing the HSR process to start before all primitive have been stored into the scene memory.
In the scene memory, every primitive is associated with the pipeline state information that was valid when the primitive was input to the pipeline. The simplest way to associate the pipeline state with each primitive is to include the entire pipeline state within each primitive. However, this would introduce a very large amount of redundant information because much of the pipeline state does not change between most primitives (especially when the primitives are in the same object). The preferred way to store information in the scene memory is to keep separate lists: one list for pipeline state settings and one list for primitives. Furthermore, the pipeline state information can be split into a multiplicity of sub-lists, and additions to each sub-list occurs only when part of the sub-list changes. The preferred way to store primitives is done by storing a series of vertices, along with the connectivity information to re-create the primitives. This preferred way of storing primitives eliminates redundant vertices that would otherwise occur in polygon meshes and line strips.
The HSR process described relative to DSGPv1 is required to be an exact hidden surface removal (EHSR) because it is the only place in the DSGPv1 where hidden surface removal is done. The exact hidden surface removal (EHSR) process 251 determines precisely which primitives affect the final color of the pixels in the frame buffer. This process accounts for changes in the pipeline state, which introduces various complexities into the process. Most of these complications stem from the per-fragment operations (ownership test, scissor test, alpha test, and the like), as described above. These complications are solved by the innovative conservative hidden surface removal (CHSR) process, described later, so that exact hidden surface removal is not required.
The fragment coloring process generates colors for each sample or group of samples within a pixel. This can include: Gouraud shading, texture mapping, Phong shading, and various other techniques for generating pixel colors. This process is different from edged walk 232 and span interpolation 234 because this process must be able to efficiently generate colors for subsections of primitives. That is, a primitive may be partially visible, and therefore, colors need to be generated for only some of its pixels, and edge walk and span interpolation assume the entire primitive must be colored. Furthermore, the HSR process may generate a multiplicity of visible subsections of a primitive, and these may be interspersed in time amongst visible subsections of other primitives. Hence, the fragment coloring process 252 should be capable of generating color values at random locations within a primitive without needing to do incremental computations along primitive edges or along the x-axis or y-axis.
The blending process 253 of the inventive embodiment combines the fragment colors together to generate a single color per pixel. In contrast to the conventional z-buffered blend process 236, this blending process 253 does not include z-buffer operations because the exact hidden surface removal process 251 as already determined which primitives are visible at each sample. The blending process 253 may keep separate color values for each sample, or sample colors may be blended together to make a single color for the entire pixel. If separate color values are kept per sample and are stored separately into the Frame buffer 240, then final pixel colors are generated from sample colors during the scan out process as data is sent to the digital to analog converter 242.
As described above for DSGPv1, the scene memory 250 stores: (1) primitive data; and (2) pipeline state. In a second embodiment of the Deferred Shading Graphics Pipeline 260 (Version 2) (DSGPv2),illustrated in FIG. 5, this scene memory 250 is split into two parts: a spatial memory 261 part and polygon memory 262 part. The split of the data is not simply into primitive data and pipeline state data.
In DSGPv2, the part of the pipeline state data needed for HSR is stored into spatial memory 261, while the rest is stored into polygon memory 262. Examples of pipeline state needed for HSR include (as defined, for example, in the OpenGL Specification) are DepthFunc, DepthMask, StencilEnable, etc. Examples of pipeline state not needed for HSR include: BlendEquation, BlendFunc, stipple pattern, etc. While the choice or identification of a particular blending function (for example, choosing R=RsAs+R0(1-As)) is not needed for HSR, the HSR process must account for whether the primitive is subject to blending, which generally means the primitive is treated as not being able to fully occlude prior geometry. Similarly, the HSR process must account for whether the primitive is subject to scissor test, alpha test, color test, stencil test, and other per-fragment operations.
Primitive data is also split. The part of the primitive data needed for HSR is stored into spatial memory 261, and the rest of the primitive data is stored into polygon memory 262. The part of primitive data needed for HSR includes vertex locations and spatial derivatives (i.e., xcex4z/xcex4x, xcex4x/xcex4y, dx/dy for edges, etc.). The part of primitive data not needed for HSR includes vertex colors, texture coordinates, color derivatives, etc. If per-fragment lighting is performed in the pipeline, the entire lighting equation is applied to every fragment. But in a deferred shading pipeline, only visible fragments require lighting calculations. In this case, the polygon memory may also include vertex normals, vertex eye coordinates, vertex surface tangents, vertex binormals, spatial, derivatives of all these attributes, and other per-primitive lighting information.
During the HSR process, a primitive""s spatial attributes are accessed repeatedly, especially if the HSR process is done on a per-tile basis. Splitting the scene memory 250 into spatial memory 261 and polygon memory 262 has the advantage of reducing total memory bandwidth.
The output from setup for incremental render 230 is input to the spatial data separation process 263, which stores all the data needed for HSR into spatial memory 261 and the rest of the data into polygon memory 262. The EHSR process 264 receives primitive spatial data (e.g., vertex screen coordinates, spatial derivatives, etc.) and the part of the pipeline state needed for HSR (including all control bits for the per-fragment testing operations).
When visible fragments are output from the EHSR 264, the data matching process 265 matches the vertex state and pipeline state with visible fragments, and tile information is stored in tile buffers 266. The remainder of the pipeline is primarily concerned with the scan out process including sample to/from pixel conversion 267, reading and writing to the frame buffer, double buffered MUX output look-up, and digital to analog (D/A) conversion of the data stored in the frame buffer to the actual analog display device signal values.
In a third embodiment of the Deferred Shading Graphics Pipeline (Version 3) (DSGPv3), illustrated in FIG. 6, the scene memory 250 is still split into two parts (a spatial memory 261 and polygon memory 262) and in addition the setup for incremental render 230 is replaced by a spatial setup which occurs after data separation and prior to exact hidden surface removal. The remainder of the pipeline structure and processes are unchanged from those already described relative to the first embodiment.
In a fourth embodiment of the Deferred Shading Graphics Pipeline (Version 4) (DSGPv4), illustrated in FIG. 7, the exact hidden surface removal of the third embodiment (FIG. 6) is replace by a conservative hidden surface removal structure and procedure and a down-stream z-buffered blend replaces the blending procedure.
In a fifth embodiment of the Deferred Shading Graphics Pipeline (Version 5) (DSGPv5), illustrated in FIG. 8, exact hidden surface removal is used as in the third embodiment, however, the tiling is added, and a tile sorting procedure is added after data separation, and the read is by tile prior to spatial setup. In addition, the polygon memory of the first three embodiments is replaced with a state memory.
In a sixth embodiment of the Deferred Shading Graphics Pipeline (Version 6) (DSGPv6), illustrated in FIG. 9, the exact hidden surface removal of the fifth embodiment (FIG. 8) is replaced with a conservative hidden surface removal, and the downstream blending of the fifth embodiment is replaced with a z-buffered blending (Testing and Blending). This sixth embodiment is preferred because it incorporates several of the beneficial features provided by the inventive structure and method including: a two-part scene memory, primitive data splitting or separation, spatial setup, tiling and per tile processing, conservative hidden surface removal, and z-buffered blending (Testing and Blending), to name a few features.
It should be noted that although several exemplary embodiments of the inventive Graphics Pipeline have been shown and described relative to FIGS. 4-9, those workers having ordinary skill in the art in light of the description provided here will readily appreciate that the inventive structures and procedures may be implemented in different combinations and permutations to provide other embodiments of the invention, and that the invention is not limited to the particular combinations specifically identified here.
The pipeline renders primitives, and the invention is described relative to a set of renderable primitives that include: 1) triangles, 2) lines, and 3) points. Polygons with more than three vertices are divided into triangles in the Geometry block, but the DSGP pipeline could be easily modified to render quadrilaterals or polygons with more sides. Therefore, since the pipeline can render any polygon once it is broken up into triangles, the inventive renderer effectively renders any polygon primitive.
To identify what part of a 3D window on the display screen a given primitive may affect, the pipeline divides the 3D window being drawn into a series of smaller regions, called tiles and stamps. The pipeline performs deferred shading, in which pixel colors are not determined until after hidden-surface removal. The use of a Magnitude Comparison Content Addressable Memory (MCCAM) allows the pipeline to perform hidden geometry culling efficiently.
One of the central ideas or inventive concepts provided by the invention pertains to Conservative Hidden Surface Removal (CHSR). The CHSR processes each primitive in time order and, for each sample that a primitive touches, makes conservative decision based on the various API state variables, such at depth test and alpha test. One of the important features of the CHSR process is that color computation does not need to be done during hidden surface removal even though non-depth-dependent tests from the API, such as alpha test, color test, and stencil test can be performed by the DSGP pipeline. The CHSR process can be considered a finite state machine (FSM) per sample. Hereinafter, each per-sample FSM is called a sample finite state machine (SFSM). Each SFSM maintains per-sample data including: (1) z-coordinate information; (2) primitive information (any information needed to generate the primitive""s color at that sample or pixel); and (3) one or more sample state bits (for example, these bits could designate the z-value or z-values to be accurate or conservative). While multiple z-values per sample can be easily used, multiple sets of primitive information per sample would be expensive. Hereinafter, it is assumed that the SFSM maintains primitive information for one primitive. The SFSM may also maintain transparency information, which is used for sorted transparencies, described in the next section.
As an example of the CHSR process dealing with alpha test, consider the diagrammatic illustration of FIGS. 10-14, particularly FIG. 11. This diagram illustrates the rendering of six primitives (Primitives A, B, C, D, E, and F) at different z-coordinate locations for a particular sample, rendered in the following order (starting with a xe2x80x9cdepth clearxe2x80x9d and with xe2x80x9cdepth testxe2x80x9d set to less-than): primitives A, B, and C (with xe2x80x9calpha testxe2x80x9d disabled); primitive D (with xe2x80x9calpha testxe2x80x9denabled); and primitives E and F (with xe2x80x9calpha testxe2x80x9d disabled). We note from the illustration that zA greater than zC greater than zB greater than zE greater than zD greater than zF, such that primitive A is at the greatest z-coordinate distance. We also note that alpha test is enabled for primitive D, but disabled for each of the other primitives.
Recall from the earlier description of CHSR, that the CHSR process may be considered to be a sample finite state machine (SFSM). The steps for rendering these six primitives under the conservative hidden surface removal process with alpha test are as follows:
Step 1: The depth clear causes the following result in each sample finite state machine (SFSM): 1) z-values are initialized to the maximum value; 2) primitive information is cleared; and 3) sample state bits are set to indicate the z-value is accurate.
Step 2: When primitive A is processed by the SFSM, the primitive is kept (i.e., it becomes the current best guess for the visible surface), and this causes the SFSM to store: 1) the z-value zA as the xe2x80x9cnearxe2x80x9d z-value; 2) primitive information needed to color primitive A; and 3) the z-value zZA) is labeled as accurate.
Step 3: When primitive B is processed by the SFSM, the primitive is kept (its z-value is less-than that of primitive A), and this causes the SFSM to store: 1) the z-value zB as the xe2x80x9cnearxe2x80x9d z-value (zA is discarded); 2) primitive information needed to color primitive B (primitive A""s information is discarded); and 3) the z-value (zB) is labeled as accurate.
Step 4: When primitive C is processed by the SFSM the primitive is discarded (i.e., it is obscured by the current best guess for the visible surface, primitive B), and the SFSM data is not changed.
Step 5: When primitive D (which has alpha test enabled) is processed by the SFSM, the primitive""s visibility can not be determined because it is closer than primitive B and because its alpha value is unknown at the time the SFSM operates. Because a decision can not be made as to which primitive would end up being visible (either primitive B or primitive D) primitive B is sent down the pipeline (to have its colors generated) and primitive D is kept. Hereinafter, this is called xe2x80x9cearly dispatchxe2x80x9d of primitive B. When processing of primitive D has been completed, the SFSM stores: 1) the xe2x80x9cnearxe2x80x9d z-value is zD and the xe2x80x9cfarxe2x80x9d z-value is ZB; 2) primitive information needed to color primitive D (primitive B""s information has undergone early dispatch); and 3) the z-values are labeled as conservative (because both a near and far are being maintained). In this condition, the SFSM can determine that a piece of geometry closer than zD obscures previous geometry, geometry farther than zB is obscured, and geometry between zD and zB is indeterminate and must be assumed to be visible (hence a conservative assumption is made). When an SFSM is in the conservative state and it contains valid primitive information, the SFSM method considers the depth value of the stored primitive information to be the near depth value.
Step 6: When primitive E (which has alpha test disabled) is processed by the SFSM, the primitive""s visibility can not be determined because it is between the near and far z-values (i.e., between zD and zB). However, primitive E is not sent down the pipeline at this time because it could result in the primitives reaching the z-buffered blend (later described as part of the Pixel Block in the preferred embodiment) out of correct time order. Therefore, primitive D is sent down the pipeline to preserve the time ordering. When processing of primitive E has been completed, the SFSM stores: 1) the xe2x80x9cnearxe2x80x9d z-value is zD and the xe2x80x9cfarxe2x80x9d z-value is zD. (note these have not changed, and zE is not kept); 2) primitive information needed to color primitive E (primitive D""s information has undergone early dispatch); and 3) the z-values are labeled as conservative (because both a near and far are being maintained).
Step 7: When primitive F is processed by the SFSM, the primitive is kept (its z-value is less-than that of the near z-value), and this causes the SFSM to store: 1) the z-value zF as the xe2x80x9cnearxe2x80x9d z-value (zD and zB are discarded); 2) primitive information needed to color primitive F (primitive E""s information is discarded); and 3) the z-value (zF) is labeled as accurate.
Step 8: When all the geometry that touches the tile has been processed (or, in the case there are no tiles, when all the geometry in the frame has been processed), any valid primitive information is sent down the pipeline. In this case, primitive F""s information is sent. This is the end-of-tile (or end-of-frame) dispatch, and not an early dispatch.
In summary of this exemplary CHSR process, primitives A through F have been processed, and primitives B, D, and F have been sent down the pipeline. To resolve the visibility of B, D, and F, a z-buffered blend (in the Pixel Block in the preferred embodiment) is included near the end of the pipeline. In this example, only the color primitive F is used for the sample.
In the preferred embodiment of the CHSR process, all stencil operations are done near the end of the pipeline (in the z-buffered blend, called the Pixel Block in the preferred embodiment), and therefore, stencil values are not available to the CSHR method (that takes place in the Cull Block of the preferred embodiment) because they are kept in the frame buffer. While it is possible for the stencil values to be transmitted from the frame buffer for use in the CHSR process, this would generally require a long latency path that would reduce performance. The stencil values can not be accurately maintained within the CHSR process because, in APIs such as OpenGL, the stencil test is performed after alpha test, and the results of alpha test are not known to the CHSR process, which means input to the stencil test can not be accurately modeled. Furthermore, renderers maintain stencil values over many frames (as opposed to depth values that are generally cleared at the start of each frame), and these stencil values are stored in the frame buffer. Because of all this, the CHSR process utilizes a conservative approach to dealing with stencil operations. If a primitive can affect the stencil values in the frame buffer, then the primitive is always sent down the pipeline (hereinafter, this is called a xe2x80x9cCullFlushOverlapxe2x80x9d, and is indicated by the assertion of the signal CullFlushOverlap in the Cull Block) because stencil operations occur before the depth test (see OpenGL specification). A CullFlushOverlap condition sets the SFSM to its most conservative state.
As another possibility, if the stencil reference value (see OpenGL specification) is changed and the stencil test is enabled and configured to discard sample values based on the stencil values in the frame buffer, then all the valid primitive information in the SFSMs are sent down the pipeline (hereinafter, this is called a xe2x80x9cCullFlushAllxe2x80x9d, and is indicated by the assertion of the signal CullFlushAll in the Cull Block) and the z-values are set to their maximum value. This xe2x80x9cflushingxe2x80x9d is needed because changing the stencil reference value effectively changes the xe2x80x9cvisibility rulesxe2x80x9d in the z-buffered blend (or Pixel Block)
As an example of the CHSR process dealing with stencil test (see OpenGL specification), consider the diagrammatic illustration of FIG. 12, which has two primitives (primitives A and C) covering four particular samples (with corresponding SFSMs, labeled SFSMO through SFSM3) and an additional primitive (primitive B) covering two of those four samples. The three primitives are rendered in the following order (starting with a depth clear and with depth test set to less-than): primitive A (with stencil test disabled); primitive B (with stencil test enabled and StencilOp set to xe2x80x9cREPLACExe2x80x9d, see OpenGL specification); and primitive C (with stencil test disabled). The steps are as follows:
Step 1: The depth clear causes the following in each of the four SFSMs in this example: 1) z-values are initialized to the maximum value; 2) primitive information is cleared; and 3) sample state bits are set to indicate the z-value is accurate.
Step 2: When primitive A is processed by each SFSM, the primitive is kept (i.e., it becomes the current best guess for the visible surface), and this causes the four SFSMs to store: 1) their corresponding z-values (either zA0, zA1, zA2, or zA3 respectively) as the xe2x80x9cnearxe2x80x9d z-value; 2) primitive information needed to color primitive A; and 3) the z-values in each SFSM are labeled as accurate.
Step 3: When primitive B is processed by the SFSMs, only samples 1 and 2 are affected, causing SFSM0 and SFSM3 to be unaffected and causing SFSM1 and SFSM2 to be updated as follows: 1) the far z-values are set to the maximum value and the near z-values are set to the minimum value; 2) primitive information for primitives A and B are sent down the pipeline; and 3) sample state bits are set to indicate the z-values are conservative.
Step 4: When primitive C is processed by each SFSM, the primitive is kept, but the SFSMs do not all handle the primitive the same way. In SFSM0 and SFSM3, the state is updated as: 1) zC0 and zC3 become the xe2x80x9cnearxe2x80x9d z-values (zA0 and zA3 are discarded); 2) primitive information needed to color primitive C (primitive A""s information is discarded); and 3) the z-values are labeled as accurate. In SFSM1 and SFSM2, the state is updated as: 1) zC1 and zC2 become the xe2x80x9cfarxe2x80x9d z-values (the near z-values are kept); 2) primitive information needed to color primitive C; and 3) the z-values remain labeled as conservative.
In summary of this example CHSR process, primitives A through C have been processed, and all the primitives were sent down the pipeline, but not in all the samples. To resolve the visibility, a z-buffered blend (in the Pixel Block in the preferred embodiment) is included near the end of the pipeline. Multiple samples were shown in this example to illustrate that CullFlushOverlap xe2x80x9cflushesxe2x80x9d selected samples while leaving others unaffected.
Alpha blending is used to combine the colors of two primitives into one color. However, the primitives are still subject to the depth test for the updating of the z-values.
As an example of the CHSR process dealing with alpha blending, consider FIG. 13, which has four primitives (primitives A, B, C, and D) for a particular sample, rendered in the following order (starting with a depth clear and with depth test set to less-than): primitive A (with alpha blending disabled); primitives B and C (with alpha blending enabled); and primitive D (with alpha blending disabled). The steps are as follows:
Step 1: The depth clear causes the following in each CHSR SFSM: 1) z-values are initialized to the maximum value; 2) primitive information is cleared; and 3) sample state bits are set to indicate the z-value is accurate.
Step 2: When primitive A is processed by the SFSM, the primitive is kept (i.e., it becomes the current best guess for the visible surface), and this causes the SFSM to store: 1) the z-value zA as the xe2x80x9cnearxe2x80x9d z-value; 2) primitive information needed to color primitive A; and 3) the z-value is labeled as accurate.
Step 3: When primitive B is processed by the SFSM, the primitive is kept (because its z-value is less-than that of primitive A), and this causes the SFSM to store: 1) the z-value zB as the xe2x80x9cnearxe2x80x9d z-value (zA is discarded); 2) primitive information needed to color primitive B (primitive A""s information is sent down the pipeline); and 3) the z-value (zB) is labeled as accurate. Primitive A is sent down the pipeline because, at this point in the rendering process, the color of primitive B is to be blended with primitive A. This preserves the time order of the primitives as they are sent down the pipeline.
Step 4: When primitive C is processed by the SFSM, the primitive is discarded (i.e., it is obscured by the current best guess for the visible surface, primitive B), and the SFSM data is not changed. Note that if primitives B and C need to be rendered as transparent surfaces, then primitive C should not be hidden by primitive B. This could be accomplished by turning off the depth mask while primitive B is being rendered, but for transparency blending to be correct, the surfaces should be blended in either front-to-back or back-to-front order.
If the depth mask (see OpenGL specification) is disabled, writing to the depth buffer (i.e., saving z-values) is not performed; however, the depth test is still performed. In this example, if the depth mask is disabled for primitive B, then the value zB is not saved in the SFSM. Subsequently, primitive C would then be considered visible because its z-value would be compared to zA.
In summary of this example CHSR process, primitives A through D have been processed, and all the primitives were sent down the pipeline, but not in all the samples. To resolve the visibility, a z-buffered blend (in the Pixel Block in the preferred embodiment) is included near the end of the pipeline. Multiple samples were shown in this example to illustrate that CullFlushOverlap xe2x80x9cflushesxe2x80x9d selected samples while leaving others unaffected.
Implementation of the Conservative Hidden Surface Removal procedure, advantageously maintains compatibility with other standard APIs, such as OpenGL. Recall that one complication of many APIs is their ability to change the depth test. Recall that the standard way of thinking about 3D rendering assumes visible objects are closer than obscured objects (i.e., at lesser z-values), and this is accomplished by selecting a xe2x80x9cless-thanxe2x80x9d depth test (i.e., an object is visible if its z-value is xe2x80x9cless-thanxe2x80x9d other geometry). Recall also, however, that most APIs support other depth tests, which may change within a frame, such as: greater-than, less-than, greater-than-or-equal-to, equal, less-than-or-equal-to, less-than, not-equal, and the like algebraic, magnitude, and logical relationships. This essentially dynamically xe2x80x9cchanges the rulesxe2x80x9d for what is visible, and as a result, the time order of primitives with different rendering rules must be taken into account.
In the case of the inventive conservative hidden surface removal, different or additional procedures are advantageously implemented for reasons described below, to maintain compatibility with other standard APIs when a xe2x80x9cgreater-thanxe2x80x9d depth test is used. Those workers having ordinary skill in the art will also realize that analogous changes may advantageously be employed if the depth test is greater-than-or-equal-to, or other functional relationship that would otherwise result in the anomalies described.
We note further that with a conventional non-deferred shader, one executes a sequence of rules for every geometry item and then look to see the final rendered result. By comparison, in embodiments of the inventive deferred shader, that conventional paradigm is broken. The inventive structure and method anticipate or predict what geometry will actually affect the final values in the frame buffer without having to make or generate all the colors for every pixel inside of every piece of geometry. In principle, the spatial position of the geometry is examined, and a determination is made for any particular sample, the one geometry item that affects the final color in the z-buffer, and then generate only that color.
Samples are done in parallel, and generally all the samples in all the pixels within a stamp are done in parallel. Hence, if one stamp can be processed per clock cycle (and there are 4 pixels per stamp and 4 samples per pixel), then 16 samples are processed per clock cycle. A xe2x80x9cstampxe2x80x9d defines the number of pixels and samples processed at one time. This per-stamp processing is generally pipelined, with pipeline stalls injected if a stamp needs to be processed again before the same stamp (from a previous primitive) has completed (that is, unless out-of-order stamp processing can be handled).
If there are no early dispatches are needed, then only end-of-tile dispatches are needed. This is the case when all the geometry in a tile is opaque and there are no stencil tests or operations and there are no alpha tested primitives that could be visible.
The primitive information in each SFSM can be replaced by a pointer into a memory where all the primitive information is stored. As described in later in the preferred embodiment, the Color Pointer is used to point to a primitive""s information in Polygon Memory.
As an alternative, only the far z-value could be kept (the near z-value is not kept), thereby reducing data storage, but requiring the sample state bits to remain xe2x80x9cconservativexe2x80x9d after primitive F and also causing primitive E to be sent down the pipeline because it would not be known whether primitive E is in front or behind primitive F.
As an alternative to maintaining both a near z-value and a far z-value, only the far z-value could be kept, thereby reducing data storage, but requiring the sample state bits to remain xe2x80x9cconservativexe2x80x9d when they could have been labeled xe2x80x9caccuratexe2x80x9d, and also causing additional samples to be dispatched down the pipeline. In the first CHSR example above (the one including alpha test), the sample state bits would remain xe2x80x9cconservativexe2x80x9d after primitive F, and also, primitive E would be sent down the pipeline because it would not be known whether primitive E is in front or behind primitive F due to the lack of the near z-value.
Processing stamps has greater efficiency than simply allowing for SFSMs to operate in parallel on a stamp-by-stamp basis. Stamps are also used to reduce the number of data packets transmitted down the pipeline. That is, when one sample within a stamp is dispatched (either early dispatch or end-of-tile dispatch), other samples within the same stamp and the same primitive are also dispatched (such a joint dispatch is hereinafter called a Visible Stamp Portion, or VSP). In the second CHSR example above (the one including stencil test), if all four samples were in the same stamp, then the early dispatching of samples 1 and 2 would cause early dispatching of samples 0 and 3. While this causes more samples to be sent down the pipeline and appear to increase the amount of color computation, it does not (in general) cause a net increase, but rather a net decrease in color computation. This is due to the spatial coherence within a pixel (i.e., samples within the same pixel tend to be either visible together or hidden together) and a tendency for the edges of polygons with alpha test, color test, stencil test, and/or alpha blending to potentially split otherwise spatially coherent stamps. That is, sending additional samples down the pipeline when they do not appreciably increase the computational load is more than offset by reducing the total number of VSPs that need to be sent. In the second CHSR example above, if all the samples are in the same stamp, then the same number of VSPs would be generated.
In the case of alpha test, if alpha values for a primitive arise only from the alpha values at the vertices (not from other places such as texturing), then a simplified alpha test can be done for entire primitives. That is, the vertex processing block (called GEO in later sections) can determine when any interpolation of the vertex alpha values would be guaranteed to pass the alpha test, and for that primitive, disable the alpha test. This can not be done if the alpha values can not be determined before CHSR is performed.
If a frame does not start with depth clear, then the SFSMs are set to their most conservative state (with near z-values at the minimum and far z-values at the maximum).
In the preferred embodiment, the CHSR process is performed in the Cull Block.
In the inventive structure and method, we note that time-order is preserved within each tile, including preserving time-order of pipeline state information. Clear packets are also used. In embodiments of the invention, the sorting is performed in hardware and RAMBUS memories advantageously permit dualoct storage of one vertex. For sorted transparency mode, guaranteed opaque geometry (that is, geometry that is known to obscure more distant geometry) is read out of Sort Memory in the first pass. In subsequent passes, the rest of the geometry is read once in each subsequent pass. In the preferred embodiment, the tile sorting method is performed in the Sort Block.
All vertices and relevant mode packets or state information packets are stored as a time order linear list. For each tile that""s touched by a primitive, a pointer is added to the vertex in that linear list that completes the primitive. For example, a triangle primitive is defined by 3 vertices, and a pointer would be added to the (third) vertex in the linear list to complete the triangle primitive. Other schemes that use the first vertex rather than the third vertex may alternatively be implemented.
In essence, a pointer is used to point to one of the vertices in the primitive, with adequate information for finding the other vertices in the primitive. When it""s time to read these primitives out, the entire primitive can be reconstructed from the vertices and pointers. Each tile is a list of pointers that point to vertices and permit recreation of the primitive from the list. This approach permits all of the primitives to be stored, even those sharing a vertex with another primitive, yet only storing each vertex once.
In one embodiment of the inventive procedure, one list per tile is maintained. We do not store the primitive in the list, but instead the list stores pointers to the primitives. These pointers are actually pointing to one of the primitives, and is a pointer into one of the vertices in the primitive, and the pointer also includes information adequate to find the other vertices in the same primitive. This sorting structure is advantageously implemented in hardware using the structure comprising three storage structures, a data storage, a tile pointer storage, and a mode pointer storage. For a given tile, the goal is to recreate the time-order sequence-of primitives that touch the particular tile being processed, but ignore the primitives that don""t touch the tile. We earlier extracted the modes and stored them separately, now we want to inject the mode packets into this stream of primitives at the right place. We note further that it is not enough to simply extract the mode packet at one stage and then reinject it at another stage, because the mode packet will be needed for processing the primitive, which may overly more than one tile. Therefore, the mode packets must be reassociated with all of the relevant tiles at the appropriate times.
One simple approach would be to write a pointer to the mode packet into every tile list. During subsequent reads of this list, it would be easy to access the mode packet address and read the appropriate mode data. However, this approach is disadvantageous because of the cost associated with writing the pointer to all or the tiles. In the inventive procedure, during processing of each tile, we read an entry from the appropriate tile pointer list and if we have read (fetched) the mode data for that vertex and sent it along, we merely retrieve the vertex from the data storage and send it down the pipeline; however, in the even that the mode data has changed between the last vertex retrieved and the next sequential vertex in the tile pointer list, then the mode data is fetched from the data storage and sent down the pipeline before the next vertex is sent so that the appropriate mode data is available when the vertex arrives. We note that entries in the mode pointer list identify at which vertex the mode changes. In one embodiment, entries. In the mode pointer store the first vertex for which the mode data pertains, however, alternative procedures, such as storing the last vertex for which the mode data applies could be used so long as consistent rules are followed.
The DSGP can operate in two distinct modes: 1) Time Order Mode, and 2) Sorted Transparency Mode. Time Order Mode is described above, and is designed to preserve, within any particular tile, the same temporal sequence of primitives. The Sorted Transparency mode is described immediately below. In the preferred embodiment, the control of the pipeline operating mode is done in the Sort Block.
The Sort Block is located in the pipeline between a Mode Extraction Unit (MEX) and Setup (STP) unit. Sort Block operates primarily to take geometry scattered around the display window and sort it into tiles. Sort Block also manages the Sort Memory, which stores all the geometry from the entire scene before it is rasterized, along with some mode information. Sort memory comprises a double-buffered list of vertices and modes. One page collects a scene""s geometry (vertex by vertex and mode by mode), while the other page is sending its geometry (primitive by primitive and mode by mode) down the rest of the pipeline.
When a page in sort memory is being written, vertices and modes are written sequentially into the sort memory as they are received by the sort block. When a page is read from sort memory, the read is done on a tile-by-tile basis, and the read process operates in two modes: (1) time order mode, and (2) sorted transparency mode.
In time ordered mode, time order of vertices and modes are preserved within each tile, where a tile is a portion of the display window bounded horizontally and vertically. By time order preserved, we mean that for a given tile, vertices and modes are read in the same order as they are written.
In sorted transparency mode, reading of each tile is divided into multiple passes, where, in the first pass, guaranteed opaque geometry is output from the sort block, and in subsequent passes, potentially transparent geometry is output from the sort block. Within each sorted transparency mode pass, the time ordering is preserved, and mode date is inserted in its correct time-order location. Sorted transparency mode by be performed in either back-to-front or front-to-back order. In the preferred embodiment, the sorted transparency method is performed jointly by the Sort Block and the Cull Block.
Conventionally hidden surfaces are removed using either an xe2x80x9cexactxe2x80x9d hidden surface removal procedure, or using z-buffers. In one embodiment of the inventive structure and method, a two-step approach is implemented wherein a (i) xe2x80x9cconservativexe2x80x9d hidden surface removal is followed by (ii) a z-buffer based procedure. In a different embodiment, a three-step approach is implemented: (i) a particular spatial Cull procedure, (ii) conservative hidden surface removal, and (iii) z-buffer. Various embodiments of conservative hidden surface removal (CHSR) has already been described elsewhere in this disclosure.
Each vertex includes a color pointer, and as vertices are received, the vertices including the color pointer are stored in sort memory data storage. The color pointer is a pointer to a location in the polygon memory vertex storage that includes a color portion of the vertex data. Associated with all of the vertices, of either a strip or a fan, is an Material-Lighting-Mode (MLM) pointer set. MLM includes six main pointers plus two other pointers as described below. Each of the six main pointers comprises an address to the polygon memory state storage, which is a sequential storage of all of the state that has changed in the pipeline, for example, changes in the texture, the pixel, lighting and so forth, so that as a need arises any time in the future, one can recreate the state needed to render a vertex (or the object formed from one or more vertices) from the MLM pointer associated with the vertex, by looking up the MLM pointers and going back into the polygon memory state storage and finding the state that existed at the time.
The Mode Extraction Block (MEX) is a logic block between Geometry and Sort that collects temporally ordered state change data, stores the state in Polygon memory, and attaches appropriate pointers to the vertex data it passes to Sort Memory. In the normal OpenGL pipeline, and in embodiments of the inventive pipeline up to the Sort block, geometry and state data is processed in the order in which it was sent down the pipeline. State changes for material type, lighting, texture, modes, and stipple affect the primitives that follow them. For example, each new object will be preceded by a state change to set the material parameters for that object.
In the inventive pipeline, on the other hand, fragments are sent down the pipeline in Tile order after the Cull block. The Mode Injection Block figures out how to preserve state in the portion of the pipeline that processes data in spatial (Tile) order instead of time order. In addition to geometry data, Mode Extraction Block sends a subset of the Mode data (cull_mode) down the pipeline for use by Cull. Cull_mode packets are produced in Geometry Block. Mode Extraction Block inserts the appropriate color pointer in the Geometry packets.
Pipeline state is broken down into several categories to minimize storage as follows: (1) Spatial pipeline state includes data headed for Sort that changes every vertex; (2) Cull_mode state includes data headed for Cull (via Sort) that changes infrequently; (3) Color includes data headed for Polygon memory that changes every vertex; (4) Material includes data that changes for each object; (5) TextureA includes a first set of state for the Texture Block for textures 0and1; (6) TextureB includes a second set of state for the Texture Block for textures 2 through 7; (7) Mode includes data that hardly ever changes; (8) Light includes data for Phong; (9) Stipple includes data for polygon stipple patterns. Material, Texture, Mode, Light, and Stipple data are collectively referred to as MLM data (for Material, Light and Mode). We are particularly concerned with the MLM pointers fir state preservation.
State change information is accumulated in the MEX until a primitive (Spatial and Color packets) appears. At that time, any MLM data that has changed since the last primitive, is written to Polygon Memory. The Color data, along with the appropriate pointers to MLM data, is also written to Polygon Memory. The spatial data is sent to Sort, along with a pointer into Polygon Memory (the color pointer). Color and MLM data are all stored in Polygon memory. Allocation of space for these records can be optimized in the micro-architecture definition to improve performance.
All of these records are accessed via pointers. Each primitive entry in Sort Memory contains a Color Pointer to the corresponding Color entry in Polygon Memory. The Color Pointer includes a Color Address, Color Offset and Color Type that allows us to construct a point, line, or triangle and locate the MLM pointers. The Color Address points to the final vertex in the primitive. Vertices are stored in order, so the vertices in a primitive are adjacent, except in the case of triangle fans. The Color Offset points back from the Color Address to the first dualoct for this vertex list. (We will refer to a point list, line strip, triangle strip, or triangle fan as a vertex list.) This first dualoct contains pointers to the MLM data for the points, lines, strip, or fan in the vertex list. The subsequent dualocts in the vertex list contain Color data entries. For triangle fans, the three vertices for the triangle are at Color Address, (Color Address-1), and (Color Address-Color Offset +1). Note that this is not quite the same as the way pointers are stored in Sort memory.
State is a time varying entity, and MEX accumulates changes in state so that state can be recreated for any vertex or set of vertices. The MIJ block is responsible for matching state with vertices down stream. Whenever a vertex comes into MEX and certain indicator bits are set, then a subset of the pipeline state information needs to be saved. Only the states that have changed are stored, not all states, since the complete state can be created from the cumulative changes to state. The six MLM pointers for Material, TextureA, TextureB, Mode, Light, and Stipple identify address locations where the most recent changes to the respective state information is stored. Each change in one of these state is identified by an additional entry at the end of a sequentially ordered state storage list stored in a memory. Effectively, all state changes are stored and when particular state corresponding to a point in time (or receipt of a vertex) is needed, the state is reconstructed from the pointers.
This packet of mode that are saved are referred to as mode packets, although the phrase is used to refer to the mode data changes that are stored, as well as to larger sets of mode data that are retrieved or reconstructed by MIJ prior to rendering.
We particularly note that the entire state can be recreated from the information kept in the relatively small color pointer.
Polygon memory vertex storage stores just the color portion. Polygon memory stores the part of pipeline stat that is not needed for hidden surface removal, and it also stores the part of the vertex data which is not needed for hidden surface removal (predominantly the items needed to make colors.)
The inventive structure and method may advantageously make use of trilinear mapping of multiple layers (resolutions) of texture maps.
Texture maps are stored in a Texture Memory which may generally comprise a single-buffered memory loaded from the host computer""s memory using the AGP interface. In the exemplary embodiment, a single polygon can use up to four textures. Textures are MIP-mapped. That is, each texture comprises a series of texture maps at different levels of detail or resolution, each map representing the appearance of the texture at a given distance from the eye point. To produce a texture value for a given pixel fragment, the Texture block performs tri-linear interpolation from the texture maps, to approximate the correct level of detail. The Texture block can alternatively performs other interpolation methods, such as anisotropic interpolation.
The Texture block supplies interpolated texture values (generally as RGBA color values) to the Phong block on a per-fragment basis. Bump maps represent a special kind of texture map. Instead of a color, each texel of a bump map contains a height field gradient.
The multiple layers are MIP layers, and interpolation is within and between the MIP layers. The first interpolation ii within each layer, then you interpolate between the two adjacent layers, one nominally having resolution greater than required and the other layer having less resolution than required, so that it is done 3-dimensionally to generate an optimum resolution.
The inventive pipeline includes a texture memory which includes a texture cache really a textured reuse register because the structure and operation are different from conventional caches. The host also includes storage for texture, which may typically be very large, but in order to render a texture, it must be loaded into the texture cache which is also referred to as texture memory. Associated with each VSP are S and T""s. In order to perform trilinear MIP mapping, we necessarily blend eight (8) samples, so the inventive structure provides a set of eight content addressable (memory) caches running in parallel. n one embodiment, the cache identifier is one of the content addressable tags, and that""s the reason the tag part of the cache and the data part of the cache is located are located separate from the tag or index. Conventionally, the tag and data are co-located so that a query on the tag gives the data. In the inventive structure and method, the tags and data are split up and indices are sent down the pipeline.
The data and tags are stored in different blocks and the content addressable lookup is a lookup or query of an address, and even the xe2x80x9cdataxe2x80x9d stored at that address in itself and index that references the actual data which is stored in a different block. The indices are determined, and sent down the pipeline so that the data referenced by the index can be determined. In other words, the tag is in one location, the texture data is in a second location, and the indices provide a link between the two storage structures.
In one embodiment of the invention Texel Reuse Detection Registers (TRDR) comprise a multiplicity of associate memories, generally located on the same integrated circuit as the texel interpolator. In the preferred embodiment, the texel reuse detection method is performed in the Texture Block.
In conventional 3-D graphics pipelines, an object in some orientation in space is rendered. The object has a texture map on it, and its represented by many triangle primitives. The procedure implemented in software, will instruct the hardware to load the particular object texture into a DRAM. Then all of the triangles that are common to the particular object and therefore have the same texture map are fed into the unit and texture interpolation is performed to generate all of the colored pixels need to represent that particular object. When that object has been colored, the texture map in DRAM can be destroyed since the object has been rendered. If there are more than one object that have the same texture map, such as a plurality of identical objects (possibly at different orientations or locations), then all of that type of object may desirably be textured before the texture map in DRAM is discarded. Different geometry may be fed in, but the same texture map could be used for all, thereby eliminating any need to repeatedly retrieve the texture map from host memory and place it temporarily in one or more pipeline structures.
In more sophisticated conventional schemes, more than one texture map may be retrieved and stored in the memory, for example two or several maps may be stored depending on the available memory, the size of the texture maps, the need to store or retain multiple texture maps, and the sophistication of the management scheme. Each of these conventional texture mapping schemes, spatial object coherence is of primary importance. At least for an entire single object, and typically for groups of objects using the same texture map, all of the triangles making up the object are processed together. The phrase spatial coherency is applied to such a scheme because the triangles form the object and are connected in space, and therefore spatially coherent.
In the inventive deferred shader structure and method we do not necessarily rely on or derive appreciable benefit from this type of spatial object coherence. Embodiments of the inventive deferred shader operate on tiles instead. Any given tile might have an entire object, a plurality of objects, some entire objects, or portions of several objects, so that spatial object coherence over the entire tile is typically absent.
Well we break that conventional concept completely because the inventive structure and method are directed to a deferred shader. Even if a tile should happen to have an entire object there will typically be different background, and the inventive Cull Block and Cull procedure will typically generate and send VSPs in a completely jumbled and spatially incoherent order, even if the tile might support some degree of spatial coherency. As a result, the pipeline and texture block are advantageously capable of changing the texture map on the fly in real-time and in response to the texture required for the object primitive (e.g. triangle) received. Any requirement to repeatedly retrieve the texture map from the host to process the particular object primitive (for example, single triangle) just received and then dispose of that texture when the next different object primitive needing a different texture map would be problematic to say the least and would preclude fast operation.
In the inventive structure and method, a sizable memory is supported on the card. In one implementation 128 megabytes are provided, but more or fewer megabytes may be provided. For example, 34 Mb, 64 Mb, 256 Mb, 512 Mb, or more may be provided, depending upon the needs of the user, the real estate available on the card for memory, and the density of memory available.
Rather that reading the 8 textels for every visible fragment, using them, and throwing them away so that the 8 textels for the next fragment can be retrieved and stored, the inventive structure and method stores and reuses them when there is a reasonable chance they will be needed again.
It would be impractical to read and throw away the eight textels every time a visible fragment is received. Rather, it is desirable to make reuse of these textels, because if you""re marching along in tile space, your pixel grid within the tile (typically processed along sequential rows in the rectangular tile pixel grid) could come such that while the same texture map is not needed for sequential pixels, the same texture map might be needed for several pixels clustered in a n area of the tile, and hence needed only a few process steps after the first use. Desirably, the invention uses the textels that have been read over and over, so when we need one, we read it, and we know that chances are good that once we have seem one fragment requiring a particular texture map, chances are good that for some period of time afterward while we are in the same tile, we will encounter another fragment from the same object that will need the same texture. So we save those things in this cache, and then on the fly we look up from the cache (texture reuse register) which ones we need. If there is a cache miss, for example, when a fragment and texture map are encountered for the first time, that texture map is retrieved and stored in the cache.
Texture Map retrieval latency is another concern, but is handled through the use of First-In-First-Out (FIFO) data structures and a look-ahead or predictive retrieval procedure. The FIFO""s are large and work in association with the CAM. When an item is needed, a determination is made as to whether it is already stored, and a designator is also placed in the FIFO so that if there is a cache miss, it is still possible to go out to the relatively slow memory to retrieve the information and store it. In either event, that is if the data was in the cache or it was retrieved from the host memory, it is placed in the unit memory (and also into the cache if newly retrieved).
Effectively, the FIFO acts as a sort of delay so that once the need for the texture is identified (prior to its actual use) the data can be retrieved and reassociated, before it is needed, such that the retrieval does not typically slow down the processing. The FIFO queues provide and take up the slack in the pipeline so that it always predicts and looks ahead. By examining the FIFO, non-cached texture can be identified, retrieved from host memory, placed in the cache and in a special unit memory, so that it is ready for use when a read is executed.
The FIFO and other structures that provide the look-ahead and predictive retrieval are provided in some sense to get around the problem created when the spatial object coherence typically used in per-object processing is lost in our per-tile processing. One also notes that the inventive structure and method makes use of any spatial coherence within an object, so that if all the pixels in one object are done sequentially, the invention does take advantage of the fact that there""s temporal and spatial coherence.
The inventive structure and method advantageously transfer information (such as data and control) from block to block in packets. We refer to this packetized communication as packetized data transfer and the format and/or content of the packetized data as the packetized data transfer protocol (PDTP). The protocol includes a header portion and a data portion.
One benefit of the PDTP is that all of the data can be sent over one bus from block to block thereby alleviating any need for separate busses for different data types. Another advantage of PDTP is that packetizing the information assists in keeping the ordering, which is important for proper rendering. Recall that rendering is sensitive to changes in pipeline state and the like so that maintaining the time order sequence is important generally, and with respect to the MIJ cache for example, management of the flow of packets down the pipeline is especially important.
The transfer of packets is sequential, since the bus is effectively a sequential link wherein packets arrive sequentially in some time order. If for example, a xe2x80x9cfill packetxe2x80x9d arrives in a block, it goes into the block""s FIFO, and if a VSP arrives, it also goes into the block""s FIFO. Each processor block waits for packets to arrive at its input, and when a packet arrives looks at the packet header to determine what action to take if any. The action may be to send the packet to the output (that is just pass it on without any other action or processing) or to do something with it. The packetized data structure and use of the packetized data structure alone and in conjunction with a bus, FIFO or other buffer or register scheme have applications broader than 3D graphics systems and may be applied to any pipeline structure where a plurality of functional or processing blocks or units are interconnected and communicate with each other. Use of packetized transfer is particularly beneficial where maintain sequential or time order is important.
In one embodiment of the PDTP each packet has a packet identifier or ID and other information. There are many different types of packets, and every different packet type has a standard length, and includes a header that identifies the type of packet. The different packets have different forms and variable lengths, but each particular packet type has a standard length.
Advantageously, each block includes a FIFO at the input, and the packets flow through the FIFOs where relevant information is accumulated in the FIFO by the block. The packet continues to flow through other or all of the blocks so that information relevant to that blocks function may be extracted.
In one embodiment of the inventive structure and method, the storage cells or registers within the FIFO""s has some predetermined width such that small packets may require only one FIFO register and bigger packets require a larger number of registers, for example 2, 3, 5, 10, 20, 50 or more registers. The variable packet length and the possibility that a single packet may consume several FIFO storage registers do not present any problem as the first portion of the packet identifies the type of packet and either directly, or indirectly by virtue of knowing the packet type, the size of the packet and the number of FIFO entries it consumes. The inventive structure and method provide and support numerous packet types which are described in other sections of this document.
Fragment coloring is performed for two-dimensional display space and involves an interpolation of the color from for example the three vertices of a triangle primitive, to the sampled coordinate of the displayed pixel. Essentially, fragment coloring involves applying an interpolation function to the colors at the three fragment vertices to determine a color for a location spatially located between or among the three vertices. Typically, but optionally, some account will be taken of the perspective correctness in performing the interpolation. The interpolation coefficients are cached as are the perspective correction coefficients.
Various compromises have conventionally be accepted relative to the computation of surface normals, particularly a surface normal that is interpolated between or among other surface normals, in the 3D graphics environment. The compromises have typically traded-off accuracy for computational ease or efficiency. Ideally, surface normals should be interpolated angularly, that is based on the actual angular differences in the angles of the surface normals on which the interpolation is based. In fact such angular computation is not well suited to 3D graphics applications.
Therefore, more typically, surface normals are interpolated based on linear interpolation of the two input normals. For low to moderate quality rendering, linear interpolation of the composite surface normals may provide adequate accuracy; however, considering a two-dimensional interpolation example, when one vector (surface normal) has for example a larger magnitude that the other vector, but comparable angular change to the first vector, the resultant vector will be overly influenced by the larger magnitude vector in spite of the comparable angular difference between the two vectors. This may result in objectionable error, for example, some surface shading or lighting calculation may provide an anomalous result and detract from the output scene.
While some of these problems could be minimized even if a linear interpolation was performed on a normalized set of vectors, this is not always practical, because some APIs support non-normalized vectors and various interpolation schemes, including, for example, three-coordinate interpolation, independent x, y, and z interpolations, and other schemes.
In the inventive structure and method the magnitude is interpolated separately from the direction or angle. The interpolated magnitude are computed then the direction vectors which are equal size. The separately interpreted magnitudes and directions are then recombined, and the direction is normalized.
While the ideal angular interpretation would provide the greatest accuracy, however, the interpolation involves three points on the surface of a sphere and various great-circle calculations. This sort of mathematical complexity is not well suited for real-time fast pipeline processing. The single step linear interpolation is much easier but is susceptible to greater error. In comparison to each of these procedures, the inventive surface normal interpolation procedure has greater accuracy than conventional linear interpolation, and lower computational complexity that conventional angular interpolation.
In a preferred embodiment of the invention, spatial setup is performed in the Setup Block (STP). The Setup (STP) block receives a stream of packets from the Sort (SRT) block. These packets have spatial information about the primitives to be rendered. The output of the STP block goes to the Cull (CUL) block. The primitives received from SRT can be filled triangles, line triangles, lines, stippled lines, and points. Each of these primitives can be rendered in aliased or anti-aliased mode. The SRT block sends primitives to STP (and other pipeline stages downstream) in tile order. Within each tile the data is organized in time order or in sorted transparency order. The CUL block receives data from the STP block in tile order (in fact in the order that STP receives primitives from SRT), and culls out parts of the primitives that definitely do not contribute to the rendered images. This is accomplished in two stages. The first stage allows detection of those elements in a rectangular memory array whose content is greater than a given value. The second stage refines on this search by doing a sample by sample content comparison. The STP block prepares the incoming primitives for processing by the CUL block. STP produces a tight bounding box and minimum depth value Zmin for the part of the primitive intersecting the tile for first stage culling, which marks the stamps in the bounding box that may contain depth values less than Zmin. The Z cull stage takes these candidate stamps, and if they are a part of the primitive, computes the actual depth value for samples in that stamp. This more accurate depth value is then used for comparison and possible discard on a sample by sample basis. In addition to the bounding box and Zmin for first stage culling, STP also computes the depth gradients, line slopes, and other reference parameters such as depth and primitive intersection points with the tile edge for the Z cull stage. The CUL unit produces the VSPs used by the other pipeline stages.
In the preferred embodiment of the invention, the spatial setup procedure is performed in the Setup Block. Important aspects of the inventive spatial setup structure and method include: (1) support for and generation of a unified primitive, (2) procedure for calculating a Zmin within a tile for a primitive, (3) the use of tile-relative y-values and screen-relative x-values, and (4) performing a edge hop (actually performed in the Cull Block) in addition to a conventional edge walk which also simplifies the down-stream hardware,
Under the rubric of a unified primitive, we consider a line primitive to be a rectangle and a triangle to be a degenerate rectangle, and each is represented mathematically as such. Setup converts the line segments into parallelograms which consists of four vertices. A triangle has three vertices. Setup describes the each primitive with a set of four points. Note that not all values are needed for all primitives. For a triangle, Setup uses top, bottom, and either left or right corner, depending on the triangle""s orientation. A line segment is treated as a parallelogram, so Setup uses all four points. Note that while the triangle""s vertices are the same as the original vertices, Setup generates new vertices to represent the lines as quads. The unified representation of primitives uses primitive descriptors which are assigned to the original set of vertices in the window coordinates. In addition, there are flags which indicate which descriptors have valid and meaningful values.
For triangles, VtxYmin, VtxYmax, VtxLeftC, VtxRightC, LeftCorner, RightCorner descriptors are obtained by sorting the triangle vertices by their y coordinates. For line segments these descriptors are assigned when the line quad vertices are generated. VtxYmin is the vertex with the minimum y value. VtxYmax is the vertex with the maximum y value. VtxLeftC is the vertex that lies to the left of the long y-edge (the edge of the triangle formed by joining the vertices VtxYmin and VtxYmax) in the case of a triangle, and to the left of the diagonal formed by joining the vertices VtxYmin and VtxYmax for parallelograms. If the triangle is such that the long y-edge is also the left edge, then the flag LeftCorner is FALSE (0) indicating that the VtxLeftC is invalid. Similarly, VtxRightC is the vertex that lies to the right of the long y-edge in the case of a triangle, and to the right of the diagonal formed by joining the vertices VtxYmin and VtxYmax for parallelograms. If the triangle is such that the long edge is also the right edge, then the flag RightCorner is FALSE (0) indicating that the VtxRightC is invalid. These descriptors are used for clipping of primitives on top and bottom tile edge. Note that in practice VtxYmin, VtxYmax, VtxLeftC, and VtxRightC are indices into the original primitive vertices.
For triangles, VtxXmin, VtxXmax, VtxTopC, VtxBotC, TopCorner, BottomCorner descriptors are obtained by sorting the triangle vertices by their x coordinates. For line segments these descriptors are assigned when the line quad vertices are generated. VtxXmin is the vertex with the minimum x value. VtxXmax is the vertex with the maximum x value. VtxTopC is the vertex that lies above the long xedge (edge joining vertices VtxXmin and VtxXmax) in the case of a triangle, and above the diagonal formed by joining the vertices VtxXmin and VtxXmax for parallelograms. If the triangle is such that the long x-edge is also the top edge, then the flag TopCorner is FALSE (0) indicating that the VtxTopC is invalid. Similarly, VtxBotC is the vertex that lies below the long x-axis in the case of a triangle, and below the diagonal formed by joining the vertices VtxXmin and VtxXmax for parallelograms. If the triangle is such that the long x-edge is also the bottom edge, then the flag BottomCorner is FALSE (0) indicating that the VtxBotC is invalid. These descriptors are used for clipping of primitives on the left and right tile edges. Note that in practice VtxXmin, VtxXmax, VtxTopC, and VtxBotC are indices into the original primitive vertices. In addition, we use the slopes (xcex4x/xcex4y) of the four polygon edges and the inverse of slopes (xcex4xyxcex4x).
All of these descriptors have valid values for quadrilateral primitives, but all of them may not be valid for triangles. Initially, it seems like a lot of descriptors to describe simple primitives like triangles and quadrilaterals. However, as we shall see later, they can be obtained fairly easily, and they provide a nice uniform way to setup primitives.
Treating lines as rectangles (or equivalently interpreting rectangles as lines) involves specifying two end points in space and a width. Treating triangles as rectangles involves specifying four points, one of which typically y-left or y-right in one particular embodiment, is degenerate and not specified. The goal is to find Zmin inside the tile. The x-values can range over the entire window width while the y-values are tile relative, so that bits are saved in the calculations by making the y-values tile relative coordinates.
A directed acyclical graph representation of 3D scenes typically assigns an identifier to each node in the scene graph. This identifier (the object tag) can be useful in graphical operations such as picking an object in the scene, visibility determination, collision detection, and generation of other statistical parameters for rendering. The pixel pipeline in rendering permits a number of pixel tests such as alpha test, color test, stencil test, and depth test. Alpha and color test are useful in determining if an object has transparent pixels and discarding those values. Stencil test can be used for various special effects and for determination of object intersections in CSG. Depth test is typically used for hidden surface removal.
In this document, a method of tagging objects in the scene and getting feedback about which objects passed the predetermined set of visibility criteria is described.
A two level object assignment scheme is utilized. The object identifier consists if two parts a group (g) and a member tag (t). The group xe2x80x9cg xe2x80x9d is a 4 bit identifier (but, more bits could be used), and can be used to encode scene graph branch, node level, or any other parameter that may be used grouping the objects. The member tag (t) is a 5 bit value (once again, more bits could be used). In this scheme, each group can thus have up to 32 members. A 32-bit status word is used for each group. The bits of this status word indicate the member that passed the test criteria. The state thus consists of: Object group; Object Tag; and TagTestlD {DepthTest, AlphaTest, ColorTest, StencilTest}. The object tags are passed down the pipeline, and are used in the z-buffered blend (or Pixel Block in the preferred embodiment). If the sample is visible, then the object tag is used to set a particular bit in a particular CPU-readable register. This allows objects to be fed into the pipeline and, once rendering is completed, the host CPU (that CPU or CPUs which are running the application program) can determine which objects were at least partially visible.
As an alternative, only the member tag (t) could be used, implying only one group.
Object tags can be used for picking, transparency determination, early object discard, and collision detection. For early object discard, an object can be tested for visibility by having its bounding volume input into the rendering pipeline and tested for xe2x80x9cvisibilityxe2x80x9d as described above. However, to prevent the bounding volume from being rendered into the frame buffer, the color, depth, and stencil masks should be cleared (see OpenGL specification for a description of these mask bits).
As an alternative to the object tags described above, a single bit can be used as feedback to the host CPU. In this method, the object being tested for xe2x80x9cvisibilityxe2x80x9d (i.e., for picking, transparency determination, early object discard, collision detection, etc) is isolated in its own frame. Then, if anything in the frame is visible, the single xe2x80x9cvisibility bitxe2x80x9d is set, otherwise it is cleared. This bit is readable by the host CPU. The advantage of this method is its simplicity. The disadvantage is the need to use individual frames for each separate object (or set of objects) that needs to be tested, thereby possibly introducing latency into the xe2x80x9cvisibilityxe2x80x9d determination.
When rendering 3D images, there is often a xe2x80x9chorizon effectxe2x80x9d where a horizontal swath through the picture has much more complexity than the rest of the image. An example is a city skyline in the distance with a simple grass plane in the foreground and the sky above. The grass and sky have very few polygons (possibly one each) while the city has lots of polygons and a large depth complexity. Such horizon effects can also occur along non-horizontal swaths through a scene. If tiles are processed in a simple top-to-bottom and left-to-right order, then the complex tiles will be encountered back-to-back, resulting in a possible load imbalance within the pipeline. Therefore, it would be better to randomly xe2x80x9chopxe2x80x9d around the screen when going from tile to tile. However, this would result in a reduction in spatial coherency (because adjacent tiles are not processed sequentially), reducing the efficiency of the caches within the pipeline and reducing performance. As a compromise between spatially sequential tile processing and a totally random pattern, tiles are organized into xe2x80x9cSuperTilesxe2x80x9d, where each SuperTile is a multiplicity of spatially adjacent tiles, and a random pattern of SuperTiles is then processed. Thus, spatial coherency is preserved within a SuperTile, and the horizon effect is avoided. In the preferred embodiment, the SuperTile hop sequence method is performed in the Sort Block
Normalization during output is an inventive procedure in which either consideration is taken of the prior processing history to determine the values in the frame buffer, or the values in the frame buffer are otherwise determined, and the range of values in the screen are scaled or normalized to that the range of values can be displayed and provide the desired viewing characteristic. Linear and non-linear scalings may be applied, and clipping may also be permitted so that dynamic range is not unduly taken up by a few relatively bright or dark pixels, and the dynamic range fits the conversion range of the digital-to-analog converter.
Some knowledge of the manner in which output pixel values are generated provides greater insight into the advantages of this approach. Sometimes the output pixel values are referred to as intensity or brightness, since they ultimately are displayed in a manner to simulate or represent scene brightness or intensity in the real world.
Advantageously, pixel colors are represented by floating point number so that they can span a very large dynamic range. Integer values though suitable once scaled to the display may not provide sufficient range given the manner the output intensities are computed to permit resealing afterward. We note that under the standard APIs, including OpenGL, that the lights are represented as floating point values, as are the coordinate distances. Therefore, with conventional representations it is relatively easy for a scene to come out all black (dark) or all white (light) or skewed toward a particular brightness range with usable display dynamic range thrown away or wasted.
Under the inventive normalization procedure, the computations are desirable maintained in floating point representations throughout, and the final scene is mapped using some scaling routine to bring the pixel intensity values in line with the output display and D/A converter capability. Such scaling or normalization to the display device may involve operations such as an offset or shift of a range of values to a different range of values without compression or expansion of the range, a linear compress or expansion, a logarithmic compression, an exponential or power expansion, other algebraic or polynomial mapping functions, or combinations of these. Alternatively, a look-up table having arbitrary mapping transfer function may be implemented to perform the output value intensity transformation. When it""s time to buffer swap in order to display the picture when it""s done, one logarithmically (or otherwise) scale during scanout.
Desirably, the transformation is performed automatically under a set of predetermined rules. For example, a rule specifying pixel histogram based normalization may be implemented, or a rule specifying a Gaussian distribution of pixels, or a rule that linearly scales the output intensities with or without some optional intensity clipping. The variety of mapping functions provided here are merely examples, of the many input/output pixel intensity transformations known in the computer graphics and digital image processing arts.
This approach would also permit somewhat greater leeway in specifying lighting, object color, and the like and still render a final output that was visible. Even if the final result was not esthetically perfect, it would provide a basis for tuning the final mapping, and some interactive adjustment may desirably but optionally be provided as a debugging, fine-tuning, or set-up operation.
When a VSP is dispatched, it corresponds to a single primitive, and the z-buffered blend (i.e., the Pixel Block) needs separate z-values for every sample in the VSP. As an improvement over sending all the per-sample z-values within a VSP (which would take considerable bandwidth), the VSP could include a z-reference-value and the partial derivatives of z with respect to x and y (mathematically, a plane equation for the z-values of the primitive). Then, this information is used in the z-buffered blend (i.e., the Pixel Block) to reconstruct the per-sample z-values, thereby saving bandwidth. Care must be taken so that z-values computed for the CHSR process are the same as those computer in the z-buffered blend (i.e., the Pixel Block) because inconsistencies could cause rendering errors.
In the preferred embodiment, the stamp-based z-value description method is performed in the Cull Block, and per-sample z-values are generated from this description in the Pixel Block.
The Phong Lighting Block advantageously includes a plurality of processors or processing elements. During fragment color generation a lot of state is needed, fragments from a common object use the same state, and therefore desirably for at least reasons of efficiency a minimizing caching requirements, fragments from the same object should be processed by the same processor.
In the inventive structure and method, all fragments that originate from the same object are sent to the same processors (or if there is sever loading to the same plurality of processors). This reduces state caching in the Phong block.
Recall that preferred embodiments of the inventive structure and method implement per-tile processing, and that a single time may include multiple objects. The Phong block cache will therefore typically store state for more than one object, and send appropriate state to the processor which is handling fragments from a common object. Once state for a fragment from a particular object is sent to a particular processor, it is desirable that all other fragments from that object also be directed to that processor.
In this connection, the Mode Injection Unit (MIJ) assigns an object or material, and MIJ allocates cache in all down stream blocks. The Phong unit keeps track of which object data has been cached in which Phong unit processor, and attempts to funnel all fragments belonging that same object to the same processor. The only optional exception to this occurs if there is a local imbalance, in which case the fragments will be allocated to another processor.
This object-tag-based resource allocation (alternatively referred to as material-tag-based resource allocation in other portions of the description) occurs relative to the fragment processors or fragment engines in the Phong unit.
The Phong unit is responsible for performing texture environment calculations and for selecting a particular processing element for processing fragments from an object. As described earlier, attempts are made to direct fragments from a common object to the same phong processor or engine. Independent of the particular texture to be applied, properties of the surfaces, colors, or the like, there are a number of choices and as a result changes in the processing environment. While dynamic microcode generation is described here relative to the texture environment and lighting, the incentive structure and procedure may more widely be applied to other types of microcode, machine state, and processing generally.
In the inventive structure and method, each time processing of a triangle strip is initiated, a change material parameters occurs, or a change almost anything that touches the texture environment happens, a microcode engine in the phong unit generates microcode and this microcode is treats as a component of pipeline state. The microcode component of state is an attribute that gets cached just like other pipeline state. Treatment of microcode generated in this manner as machine state generally, and as pipeline state in a 3D graphics processor particularly, as substantial advantages.
For example, the Phong unit includes multiple processors or fragment engines. (Note that the term fragment engines here describes components in the Phong unit responsible for texture processing of the fragments, a different process than the interpolation occurring in the Fragment Block.) The microcode is downloaded into the fragment engines so that any other fragment that would come into the fragment engine and needs the same microcode (state) has it when needed.
Although embodiments of each of the fragment engines in the Phong Block are generically the same, the presence of the downloadable microcode provides a degree of specialization. Different microcode may be downloaded into each one dependent on how the MIJ caching mechanism is operating. Dynamic microcode generation is therefore provided for texture environment and lighting
Generating variable scale bump maps involves one or both of two separate procedures: automatic basis generation and automatic gradient field generation. Consider a gray scale image and its derivative in intensity space. Automatic gradient filed takes a derivative, relative to gray scale intensity, of a gray scale image, and uses that derivative as a surface normal perturbation to generate a bump for a bump map. Automatic basis generation saves computation, memory storage in polygon memory, and input bandwidth in the process.
For each triangle vertex, an s,t and surface normal are specified. But the s and t aren""t color, rather they are two-dimensional surface normal perturbations to the texture map, and therefore a texture bump map. The s and t are used to specify the directions in which to perturb the surface normals in order to create a usable bump map. The s,t give us an implied coordinate system and reference from which we can specify perturbation direction. Use of the s,t coordinate system at each pixel eliminates any need to specify the surface tangent and the bi-normal at the pixel location. As a result, the inventive structure and method save computation, memory storage and input bandwidth.
A set of per-pixel tile staging buffers exists between the PixelOut and the BKE block. Each of these buffers has three state bits Empty, BkeDoneForPix, and PixcDoneForBke associated with it. These bits regulate (or simulate) the handshake between the PixelOut and Backend for the usage of these buffer. Both the backend and the PixelOut unit maintain current InputBuffer and OutputBuffer pointers which indicate the staging buffer that the unit is reading from or writing to.
For preparing the tiles for rendering by PIX, the BKE block takes the next Empty buffer and reads in the data from the frame buffer memory (if needed, as determined by the RGBAClearMask, DepthMask, and StencilMaskxe2x80x94if a set of bit planes is not cleared it is read into). After Backend is done with reading in the tile, it sets the BkeDoneForPix bit. PixelOut looks at the BkeDoneForPix bit of the InputTile. If this bit is not set, then pixelOut stalls, else it clears the BkeDoneForPix bit, and the color, depth, and/or stencil bit planes (as needed) in the pixel tile buffer and transfers it to the tile sample buffers appropriately.
On output, the PixelOut unit resolves the samples in the rendered tile into pixels in the pixel tile buffers. The backend unit (BKE) block transfers these buffers to the frame buffer memory. The Pixel buffers are traversed in order by the PixelOut unit. PixelOut emits the rendered sample tile to the same pixel buffer that it came from. After the tile output to the pixel tile buffer is completed, the PixelOut unit sets the PixDoneForBke bit. The BKE block can then take the pixel tile buffer with PixDoneForBke set, clears that bit and transfer it to the frame buffer memory. After the transfer is complete, the Empty bit is set on the buffer.
The Backend Unit is responsible for sending data and or signals to the CRT or other display device and includes a Digital-to-Analog (D/A) converter for converting the digital information to analog signals suitable for driving the display. The backend also includes a bilinear interpolator, so that pixels from the frame buffer can be interpolated to change the spatial scale of the pixels as they are sent to the CRT display. The pixel zooming during scanout does not involve rerendering it just scales or zooms (in or out) resolution on the fly. In one embodiment, the pixel zooming is performed selectively on a per window basis, where a window is a portion of the overall desktop or display area.
Conventional structures and methods provide an on-screen memory storage and an off-screen memory storage, each having for example, a color buffer, a z-buffer, and some stencil. The 3D rendering process renders to these off-screen buffers. The one screen memory corresponds to the data that is shown on the display. When the rendering has completed to the off-screen memory, the content of the off-screen memory is copied to the on-screen memory in what is referred to as a block transfer (BLT).
In order to save memory bandwidth and realize other benefits described elsewhere in this description, the inventive structure and method perform a xe2x80x9cvirtualxe2x80x9d block transfer or virtual BLT by splicing the data in or reading the data from an alternate location.
A token in this context is an information item interposed between other items fed down the pipeline that tell the pipeline what the entries that follow correspond to. For example, if the x,y,z coordinates of a vertex are fed into the pipeline and they are 32-bit quantities, the tokens are inserted to inform the pipeline that the numbers that follow are vertex x,y,z values since there are no extra bits in the entry itself for identification. The tokens that tell the pipeline hardware how to interpret the data that""s being sent in.