A three-dimensional (3D) graphic processing device uses a description of an object such as a polygon, line, or triangle to generate the object""s constituent fragments. A fragment is defined as all information required to render a single pixel that is within the boundaries of the object, for example, the x and y coordinates of the pixel, the red, green and blue color values used to modify the pixel, alpha transparency and Z depth values, texture coordinates, and the like. The graphics device must determine which fragments are contained within the object. Most prior art fragment generation methods fall into two categories: scanline and half-plane edge functions.
A scanline-based fragment generator renders trapezoids on a graphics rendering surface of an output device, such as a printer page or a display terminal screen. Without loss of generality, here a scanline is considered to be a (horizontal) row of pixels, and the top and bottom edges of the trapezoid are horizontal. Note that some fragment generators consider a scanline to be a (vertical) column of pixels and the right and left edges of the trapezoid are vertical.
The scanline fragment generator determines the inverse of the slope of the left and right edges of the trapezoid in order to determine how many pixels the left and right edges move horizontally when moving from one scanline to the next. At each scanline, the generator uses the inverse slope information to determine a starting pixel address and either a length or ending pixel address. This information is used to generate corresponding fragment information for each pixel position on the scanline within the object.
To render a non-trapezoidal object, such as an arbitrary triangle, the generator, in effect, renders two trapezoids while sharing some computation between the two. The generator first determines the inverse of the slope of all three edges of the triangle. The generator then vertically partitions the triangle into a top portion and a bottom portion, the point for partitioning being they coordinate of the vertex that is between the top and bottom of the triangle.
The two portions are degenerate trapezoids. The top portion has a top edge with a length of zero; the bottom portion has a bottom edge with a length of zero. The fragments for the top trapezoid can then be generated, and one of the inverse slopes used to generate the top portion can later be used to generate fragments for the bottom trapezoid portion.
A half-plane edge function fragment generator uses planar (affine) edge functions of the x and y screen coordinates. The values of these edge functions at a given pixel determine directly if the pixel is inside or outside an object. As an advantage, the generator does not need to determine the inverse slopes of the edges of the objects. However, traversal of the object is less intuitive than with a scanline generator. Given the value of the edge functions at various points surrounding the current position, the generator decides where to go next.
An introduction to half-plane edge functions is given by J. Pineda in xe2x80x9cA Parallel Algorithm for Polygon Rasterization,xe2x80x9d ACM Computer Graphics, Volume 22, Number 4, August 1988 (SIGGRAPH 1988 issue), which is hereby incorporated by reference as background information, though the basic traversals methods described by Pineda are less than optimal.
As a very brief summary, each directed edge of an object, such as a triangle with three edges or a line with four edges, is represented as function that partitions the 2D (x, y) rendering plane into two portions: at points to the left of the parting edge with respect to its direction, the function is negative, and at points on the parting edge or to the right of the parting edge the function is nonnegative, that is, zero, or positive.
By combining information from all edge functions at a given point, it can be determined whether the point is inside or outside the object. For example, if the three directed edges of a triangle connect in a clockwise fashion, then a point is inside the triangle if all three edge functions are nonnegative. If the three edges connect in a counterclockwise fashion, then a point is inside the triangle if all three edge functions are negative. Note that points along an edge or vertex that is shared between two or more objects should be assigned to exactly one object. The edge equations can be adjusted during setup to accomplish this.
FIG. 2 shows a triangle 200 that can be described by three clockwise directed edges 201-203, which are shown as bold arrows. The half-plane where each corresponding edge function is nonnegative is shown by the several thin xe2x80x9cshadowxe2x80x9d lines 210. The shadow lines 210 have the same slope as the corresponding edge. The shaded portion of FIG. 2 shows the area where all edge functions are nonnegative, i.e., points within the triangle object 200.
One advantage of using half-plane edge functions is that parallel fragment generation is possible. For example, one can define a xe2x80x9cfragment stampxe2x80x9d as a 2m pixel wide by 2n pixel high rectangle, and simultaneously determine all fragments that are within both the fragment stamp and the object.
Most known half-plane based fragment generators first move the stamp horizontally left, and then horizontally right across a row xe2x80x9cstamplinexe2x80x9d before stepping up or down somewhere into the next stampline. A stampline is similar to a scanline, except that a row stampline has a height equal to the height (i.e., the vertical extent of the stamp, as measured in units of pixels) of the fragment stamp. Alternatively, the stamp can be moved vertically up and down in a column stampline, followed by stepping horizontally into the next column stampline. In this alternative, the column stampline has a width equal to the width of the fragment stamp.
Although Pineda does not describe stamp movement in any great detail, his most efficient implementation implies a method that starts at a vertex that lies on one of the four edges of a minimal horizontally and vertically aligned rectangular bounding box that encloses the object.
The best Pineda traversal method requires at least two stamp contexts. A stamp context is all the information needed to place the stamp at a given position within the object. The context information includes the x and y position of the stamp, the value of all four half-plane edge evaluators, as well as the value of all channel data being interpolated from values provided at the object""s vertices. The channel data includes, for example, color, transparency, Z depth, and texture coordinates.
Unfortunately, the Pineda implementation frequently allows the stamp to move outside of the object. This means that the stamp has to somehow find its way back into the object. This increases the amount of time taken to traverse the object completely.
One way to fix this straying problem is to start at a vertex of the triangle that is at one corner of the minimal bounding box. However, usually no vertex of a wide line or an antialiased line will be in the corner of the bounding box, so this solution is of limited usefulness. A more general solution, which works for xe2x80x9cfour-sided linesxe2x80x9d as well as three-sided triangles, adds a third stamp context. If no restrictions are placed upon the starting vertex, then four stamp contexts are required.
Typically, it takes approximately 600 bits or more to store a stamp context. With so many bits, the amount of chip xe2x80x9creal estatexe2x80x9d required to store stamp contexts becomes significant. Furthermore, as more contexts are used, the decision logic to compute and multiplex the next stamp position becomes more complex and slower. Because stamp movement computations cannot be pipelined, this decision and multiplexing logic may determine the minimum cycle time of the fragment generation logic. Thus, it is desirable that movement methods be implemented with a minimum number of such stamp contexts.
Regardless of the number of contexts used, the stamp movement methods implied by Pineda, and other known scanline fragment generators, traverse an object in a similar manner. They generate all fragments on a stampline, and then proceed to the next stampline.
Consequently, none of these approaches generate fragments in an order that is most efficient for a frame buffer constructed from typical dynamic RAM (DRAM, VRAM, SDRAM, SGRAM, FBRAM, etc.) used in graphics processors. This is true for the following reasons.
Dynamic RAM is partitioned into pages. A dynamic RAM offers one or more banks. Each bank acts as a cache line in a direct-mapped cache for the pages. That is, each page in the RAM is associated with exactly one of the banks. The RAM offers very fast access to a page that is already loaded into its corresponding bank.
However, to access a page which is not already loaded into its corresponding bank, the bank must be written back to the page from which it was loaded (xe2x80x9cprechargedxe2x80x9d), and the new page must be loaded into the bank (xe2x80x9crow activatedxe2x80x9d). The precharge and row activate operations typically take three to eight times longer than accessing data already loaded into the bank. The combination of precharge and row activate operations is hereafter referred to as xe2x80x9cpage crossing overhead.xe2x80x9d
To alleviate this overhead, some modem DRAMs (e.g. SDRAM, RAMBUS Direct RAM) allow precharge and row activate operations for one bank to be overlapped with data read or write operations in another bank. If precharge and row activate commands are issued sufficiently far in advance (the page is xe2x80x9cprefetchedxe2x80x9d), then the page crossing overhead can be substantially reduced, or even completely hidden.
In order to reduce page crossing overhead, it is desirable to:
(1) arrange page dimensions so that most objects are stored in as few pages as possible, and
(2) generate all the fragments for an object that reside in a given page before generating any fragments for a different page.
In order to satisfy (1), most graphics systems xe2x80x9ctilexe2x80x9d the rendering plane (screen or printer page) with DRAM pages that are as square as possible rather than linearly allocating screen pixels to pages. For example, rather than allocating a page that can hold 64-pixels as a strip that is 64 pixels wide by 1 pixel high, a graphics accelerator might allocate the page as a tile that is 8 pixels wide by 8 pixels high. On the average, this mapping of pixel locations into physical memory locations tends to group more fragments of an object onto a given page.
FIGS. 3A-3D demonstrate this mapping. The thin lines 301 demarcate pixel boundaries, while the thick lines 302 demarcate page boundaries. The arrows 303 show the order in which fragments are generated, starting at the top-most scanline down through the bottom-most scanline. FIGS. 3A-3D show traversal orders for triangles 300 residing in one to four pages respectively.
One Page
In FIG. 3A, all pixels within the triangle lie on the same page, which substantially reduces page crossing overhead when compared to a linear assignment of pixels to pages. Unfortunately, when compared to a linear allocation, this technique can increase the page crossing overhead for some small triangles, and for nearly all large triangles, which must access two or more pages on each scanline in the widest parts of the triangle.
Two Pages
FIG. 3B shows such a situation in which fragment generation alternates between two pages of memory on the second, third, and fourth scanlines, requiring two page crossings on each such scanline. A one-bank DRAM would incur expensive page crossing overhead twice on these scanlines. A two-bank DRAM would be more forgiving, as most graphics accelerators xe2x80x9ccheckerboardxe2x80x9d pages, so that pages that are horizontally or vertically adjacent lie in different banks. With such checkerboarding, the accelerator would access the two different pages in different banks.
Three Pages
For some objects, even a two-bank DRAM encounters problems. FIG. 3C shows a triangle that is stored in three pages. Two of the pages must use the same bank in a two-bank DRAM. For example, if the two banks are checkerboarded, the left-most and right-most pages reside in the same bank. Page crossing overhead occurs twice on each of the first three scanlines-once to fetch the left-most page into the bank, and once to fetch the right-most page into the bank.
Four Pages
FIG. 3D shows a triangle that is stored in four pages, two for each bank in a two-bank DRAM. The crossing from the top two banks to the bottom two banks may have insufficient work on the bottom scanline of each of the top pages to allow page crossing overhead to be completely hidden by prefetching. For example, if pages are checkerboarded, the top left and bottom right pages share bank A, and the top right and bottom left pages share bank B. The bottom right page cannot be fetched into bank A until all transactions in the top left page are completed. Even worse, the bottom left page cannot be fetched into bank B until all transactions in the top right page are completed. The page crossing overhead from the top right page to the bottom left page is fully exposed.
It would thus be desirable to be able to constrain the order of fragment generation so that all fragments of an object on each page are generated before any fragments on another page.
Checkerboarding
In order to maximize the possibility of hiding page crossing overhead by prefetching early enough, many graphics accelerators not only allocate each page to a rectangular region of the rendering plane, but as mentioned above, further allocate the rectangular regions such that a given page in one bank is in a different bank from the pages above, below, left, or right of it.
FIG. 4 shows this xe2x80x9ccheckerboardedxe2x80x9d arrangement of pages where again thin lines 401 demarcate pixel boundaries, while the thick lines 402 demarcate page boundaries. Further, the shaded pages 403 belong to one bank, while the unshaded pages 404 belong to the other bank.
To take advantage of multiple bank DRAM, it is desirable that the fragment generator be aware of and exploit the bank arrangements, so that after all fragments on one page have been generated, the next page for which fragments are generated is in a different bank if possible.
Texture Cache Accesses
Furthermore, the efficiency of accesses to texture memory are directly influenced by the order in which fragments are generated. If the texture memory has a cache associated with it, then rendering large triangles may cause a sudden and large increase in texture cache capacity misses. This is because texture data fetched for a fragment on one stampline is ejected from the cache before the data can be reused for nearby fragments on an adjacent scanline.
Thus, it would be desirable to be able to constrain the order of fragment generation so that the capacity miss rate of the texture cache is reduced. That is, the rendering surface can be partitioned into rectangular tiles, where all positions within a tile should be visited before moving to another tile, and where the tile size is related to the texture cache size(s), the texture cache line size, and the hierarchical structure of the cache.
It is also desirable to maintain locality of reference in texture memory when moving from one tile to another. That is, when all positions in the object within one tile have been visited, it is desirable to move to a nearby tile rather than to a more distant tile.
Furthermore, while maintaining all the benefits of mapping tile dimensions to memory pages, it is desirable to simultaneously decrease the texture cache miss rate. Specifically, it would be desirable to visit all locations within a tile before visiting any positions in other tiles. Smaller tiles may be combined into a larger tile, a metatile overlaying smaller tiles. Thus, once all the locations in a tile are visited, the next tile visited should be within the metatile. When all of the tiles in a metatile have been visited, a different metatile is selected, and the process of visiting locations within a tile and then visiting other tiles within the metatile is repeated.
Tiling Prior Art
The paper xe2x80x9cThe Design and Analysis of a Cache Architecture for Texture Mapping,xe2x80x9d by Ziyad S. Hakura and Anoop Gupta, in Proceedings of the 24th ISCA (1997), describes how various performance results improve when fragments are generated in tiles. However, the details of how to accomplish such tiling are not described. Since this paper describes software simulation, it is likely that the tiling fragment generation is based upon a scanline generator. The high degree of parallelism in half-plane generators is a boon for hardware implementations but is usually a source of inefficiency for software implementations.
Microsoft""s Talisman, see xe2x80x9cTalisman: Commodity Realtime 3D Graphics for the PC,xe2x80x9d by Jay Torborg and James Kajiya, in Proceedings of SIGGRAPH 96, and an Apple chip described in xe2x80x9cHardware Accelerated Rendering of Antialiasing Using a Modified A-Buffer Algorithmxe2x80x9d, by Stephanie Winner et al. in Proceedings of SIGGRAPH 97, must process xe2x80x9cblocksxe2x80x9d of fragments, because these implementations do not include enough memory to hold all fragment information needed to render 3D graphics on a full rendering plane.
However, those implementations bear little resemblance to the graphics processor described here. They require that all fragments from different objects that lie within a particular portion of memory be generated before any fragments for a neighboring portion. Therefore, those implementations require that the graphics engine save up all objects in a scene, sort these objects, replicate the objects when an object has fragments in two or more portions of memory, then present all the objects in each portion to the fragment generator as a group, and then present all the objects (some duplicated) in the next block, etc. The fragment generator does not automatically move from block to block within an object, but is instead presented with the same object multiple times at perhaps widely separated intervals in time. Each time it is presented with a different block from a given object, it is either provided with a new starting point within the object, or it is given a xe2x80x9cnewxe2x80x9d object, which is the original object clipped to the current block""s boundaries.
Sorting and replicating graphic objects consumes system resources, as does computing a multiple starting points for an object or clipping an object to each block it overlaps. For some 3D application interfaces, such as OpenGL, which do not require one to present all objects in a frame before anything can be rendered, it is impossible to use these prior art techniques.
The present invention relates to a method and a computer system for visiting all stamp locations that are relevant to a two-dimensional convex polygonal object, such as might be encountered when rendering an object on a display device. The object is visited with a rectangular stamp, which contains one or more discrete sample points. A relevant location is one in which the object contains at least one of the stamp""s sample points when the stamp is placed at that location. Stamp locations are discrete points that are separated vertically by the stamp""s height, and horizontally by the stamp""s width. The stamp may move to a nearby position, or to a previously saved position, as it traverses the object. The plane in which the object lies is partitioned into rectangular tiles, which are at least as wide and high as the stamp. The invention visits stamp locations in an order that respects tile boundariesxe2x80x94that is, it visits all locations within one tile before visiting any locations within another tile.
In terms of the method, the invention uses each pair of vertices, in the order presented, to construct a directed edge between the vertices. Each directed edge is represented by an affine function of the form E(x,y)=Ax+By+C, in which all points to the left of the edge have a negative value, all points on the edge have a zero value, and all points to the right of the edge have a positive value. Points are considered within the object if all edge functions are nonnegative for objects described by a series of clockwise vertices, or if all edge functions are negative for objects described by a series of counterclockwise vertices. Some edge functions are effectively infinitesimally displaced from their corresponding edge, so that edges that are shared between adjacent objects assign points directly on the edge to exactly one of the objects. The edge functions are evaluated at several points near the current position. Some nearby stamp positions are also checked to see if they are within the same tile or within a different tile. The sign bits of all edge functions are evaluated at several points, and the bits indicating if nearby stamp positions are in the same or a different tile are combined to determine if the next position of the stamp should be one of the nearby positions, if the next position should be fetched from a previously stored context, or if all locations within the object have been visited. These bits are also combined to determine which, if any, of the nearby locations should be stored into their corresponding contexts.
In one aspect of the invention, the first stamp position is near a vertex that lies on an edge of the unique minimal rectangular bounding box that contains the object and has two horizontal and two vertical edges. The invention uses up to six contexts, the current context as well as five saved contexts, to visit all locations within the object while respecting tile boundaries.
In another aspect of the invention, one of the five saved contexts shares physical storage space with two other saved contexts, and so while the invention conceptually uses a total of six contexts, it physically uses space for only five contexts.
In another aspect of the invention, a different polygon traversal process enables the invention to respect tile boundaries with only four contexts.
In another aspect of the invention, the traversal order from tile to tile occurs as much as possible in a serpentine manner. That is, when all locations in the object within one tile have been visited, the next tile visited is chosen to be close whenever possible.
In another aspect, tiles are partitioned into two or more disjoint sets. Tiles are arranged such that for any given tile belonging to one of the sets, each adjacent tile above, below, left and right of the tile belongs to a different set from the given tile""s set. When tiles are partitioned into two sets, this results in a familiar checkerboard pattern of tiles. When all locations in the object within one tile have been visited, the next tile visited is chosen to be within a different set whenever possible.
In another aspect of the invention, the plane in which the object lies is partitioned into a second grid of tiles (xe2x80x9cmetatilesxe2x80x9d), and the visitation order respects both tile and metatile boundaries. Each tile may be completely contained within a metatile; alternatively, the tile and metatile grids may be offset such that each tile is contained in several metatiles. The invention visits each location in the object respecting both tile and metatile boundaries, by visiting all locations in one metatile before visiting any locations within another metatile, and within each metatile by further visiting all locations within one tile before visiting any locations in another tile.