The present invention relates to a method for rasterizing a graphic primitive and, in particular, to an accelerated method for rasterizing a graphic primitive in a graphics system in order to generate pixel data for the graphic primitive from graphic primitive description data.
For accelerating the process of image-rendering of three-dimensional images, it is known to use multi-processors or hardware pipelines in parallel. Each of these units acts upon a sub-set of the information contained in an entire image, as has been described by James D. Foly et. al in xe2x80x9cComputergraphic Principles and Practicexe2x80x9d, second edition, 1990, pages 887 to 902. This task can be divided up by either processing, in parallel, objects (polygons) in the image, or by processing certain sections of the image in parallel. Mere implementation of the division of objects leads to a subdivision of the object level description of a scene (vertex list), so that each of the processors is equally loaded. This division is carried out independently of the arrangement of the respective objects in the three-dimensional world or in a frame buffer.
The implementation of the division of the task by forming sections in an image is effected by subdividing a frame buffer into sub-sections which normally have the same size. With regard to dividing the frame buffer, there is the possibility of either associating the same with large, continuous pixel blocks or of effecting the association in an interleaved manner.
FIG. 21 shows the above-described possibilities of partitioning a frame buffer with regard to the case of a graphics system operating with four graphics processing engines. FIG. 21a shows the association of large continuous pixel blocks to the respective graphics processing engines. As can be seen, in this exemplary case, the frame buffer 10 is subdivided into four equallysized blocks which are associated to the engines. FIG. 21b shows the interleaved frame partitioning of the frame buffer 10, and it can be seen that the processing of the individual pixels 12, which are represented by the boxes in FIG. 21, is effected in an interleaved manner by the four graphics processing engines of the graphics system.
Interleaved partitioning is used very frequently, since it offers the advantage that the workload of the individual processors is automatically balanced. Except for the smallest polygons, all polygons are located in all partitions of the frame, so that almost every image renderer is supplied with the same number of pixels. Interleaved frame buffer partitioning is also referred to as xe2x80x9cdistributed frame bufferxe2x80x9d.
FIG. 22 shows a block diagram of a conventional graphics system having a pipeline for pixel processing. The graphics system, in its entirety, is denoted by reference numeral 14 and includes a scan converter 16 receiving, at its input, data which write onto the graphic primitive, e.g., a polygon, to be processed. The scan converter 16 processes the received data and produces, at its output, interpolator commands which are entered into a parameter interpolator 18. The output of the parameter interpolator 18 is connected to a first input of a pixel pipeline 20. The output of the pixel pipeline 20 is connected to a memory subsystem 24 via a packing unit 22. Data from the memory subsystem 24 are supplied to the second input of the pixel pipeline 20 via a depacking unit 26.
FIG. 23 shows a block diagram of a conventional graphics system with a plurality of pipelines working in parallel. The graphics system, in its entirety, is denoted by reference numeral 28, and identical or similar elements, such as in the system in FIG. 22, are provided with the same reference numerals. Unlike the graphics system illustrated in FIG. 22, the scan converter 16 is designed as a parallel scan converter and, similarly, the parameter interpolator 18 is designed as a parallel parameter interpolator. This parallel parameter interpolator 18 has a plurality of outputs for supplying data to a plurality of pixel pipelines 200-20n, outputs of the pixel pipelines being connected with the packing unit 22. The depacking unit 26 is connected with the second inputs of the respective pixel pipeline 200-20n.
Parallel image processing using interleaved frame partitioning constitutes a very suitable method for hardware implementation of image-rendering pipelines, as shown in FIG. 23. The memory subsystem 24 typically manages so-called memory words containing a plurality of pixels. A 128-bit word, for example, contains four color pixels (true color pixels), with each pixel including 32 bits. The memory subsystem 24 can either read or write such a word during a clock cycle. In a graphics system having a single pixel pipeline, such as is shown in FIG. 22, the depacking unit 26 must, for fragment calculation (e.g., texture fade-overs, reflecting additions, target fade-overs, dithering, raster operations, and the like), extract one pixel per clock and convert it into the internal color format. Packing unit 22 converts the results of the pixel pipeline calculation into the color format stored in the memory and unites several pixels to form one memory word.
Systems having several image-rendering pipelines, as are shown in FIG. 23, can process, in parallel, several pixels contained in one memory word. If the number of pixel pipelines is equal to the number of pixels per memory word, packing and depacking the same becomes trivial.
Graphics processing systems mostly use image-rendering engines whose primitives are polygons. In addition, these polygons are limited to certain types, such as triangles or quadrilateral elements. More complex polygons can then be defined using these graphic primitives.
The basic challenge in processing graphic primitives is that determining whether a point in a screen area is within or outside the graphic primitive to be rendered must be as simple as possible. For triangles, this can be achieved, for example, in that the three edges forming the graphic primitive are written onto by means of linear edge functions.
FIG. 24 shows an example of a linear edge function. In the Cartesian co-ordinate system in FIG. 24, an edge 30 of a graphic primitive is illustrated by way of example, and the starting point and the end point, respectively, of the edge are determined by the co-ordinates x0 and y0 and x1 and y1, respectively.
It can be determined by the edge function indicated in the right-hand section of FIG. 24 whether a point within the Cartesian co-ordinate system is located to the left or the right of the edge or on the edge. Point P is located on the edge 30 and, in this case, the value for the edge function is 0. Point Q is located to the right of edge 30 and, in this case, the result of the edge function is larger than 0, whereas for point R, which is located to the left of edge 30, the result of the edge function is smaller than 0. In other words, each of the linear edge functions yields a value of 0 for co-ordinates which are located exactly on the edge or on the line, a positive value for co-ordinates located to one side of the line or edge, and a negative value for co-ordinates located to the other side of the line or edge. The sign of the linear edge function subdivides the drawing surface into two half-planes.
Linear edge functions are further described in the following articles: J. Pineda xe2x80x9cA Parallel Algorithm for Polygon Rasterisationxe2x80x9d Seggraph Proceedings, Vol. 22, No. 4, 1988, pages 17 to 20; H. Fuchs et. al, xe2x80x9cFast Spheres Shadows, Textures, Transparences, and Image Enhancements in Pixel-Planesxe2x80x9d; Seggraph Proceedings, Vol. 19, No. 3, 1985, pages 111 to 120; Dunnet, White, Lister, Grinsdale University of Sussex, xe2x80x9cThe Image Chip for High Performancexe2x80x9d, IEEE Computer Graphics and Applications, November 1992, pages 41 to 51.
By multiplying the edge functions with the value of xe2x88x921, the sign for the half-planes can be inverted, and the edge function can further be normalized for indicating a distance of a point from the edge, as has been described by A. Schilling in xe2x80x9cA New, Simple and Efficient Antialiasing with Subpixel Marksxe2x80x9d, Seggraph Proceedings, Vol. 25, No. 4, 1991, pages 1, 2, 3 to 141. This is useful, in particular, for pixel overlap calculations for performing edge antialiasing (antialiasing=measure for reducing image distortions).
The linear edge functions are calculated incrementally from a given starting point, which is particularly desirable for hardware implementations, since this offers the possibility of merely using simple adders instead of costly multipliers. FIG. 25 shows an example of edge function increments, wherein the starting point is denoted by E, E+dex indicates the incrementation in the x direction, and E+dey indicates the incrementation in the y direction. The right-hand part of FIG. 25 describes the determination of the incremental values of dex and dey, respectively. If the edge function is, itself, normalized or inverted, it is required to also normalize and invert the delta values for the incremental steps, indicated in FIG. 25.
For a triangle, the three edge functions can be arranged such that all three edge functions supply positive values only for such co-ordinates which are located within the triangle. If at least one edge function yields a negative value, the co-ordinate in question, i.e., the pixel in question, is located outside the triangle. FIG. 26A shows the sign distribution for the three edges 30a, 30b, 30c of a triangle-shaped graphic primitive 32. The boxes 12 shown in FIG. 26 each illustrate an illustratable pixel. As can be seen, the edge functions for the edges 30a to 30d yield a negative value whenever the co-ordinate is located outside the graphic primitive 32, and a result with a positive sign is output only when the co-ordinate is located within the same.
Typically, the scan conversion hardware obtains the edge function values of all three edges for a given starting point together with the delta values for the x and y directions, so as to enable incremental calculation of the successive co-ordinates. With each clock, the scan converter advances by one pixel in the horizontal direction or by one pixel in the vertical direction. FIG. 26B shows a potential traversing algorithm for passing through the triangle 32 already shown in FIG. 26A. The scan path is shown in FIG. 26B and, as can be seen, the triangle is passed through up to the last pixel 36 in the manner shown, starting from a starting pixel 34. From here, the algorithm jumps to a further graphic primitive to be processed. In traversing the graphic primitive, edge function values for older positions can be stored so as to enable a return to the same or to their neighbors. The aim is to consume as few clock cycles per triangle as possible or, in other words, to avoid the scanning of pixels outside the triangle, which will be referred to as invisible pixels in the further course of the description. For example, a simple method might consist in traversing all pixels which are contained within the enclosure triangle of the graphic primitive and in verifying the same with regard to their visibility. This would evidently mean that at least 50% of non-visible pixels would have to be traversed. In contrast to this, the algorithm shown in FIG. 26B is developed further, after it has scanned the triangle on a scan line-by-scan line basis, with a leading edge of the triangle being tracked. The leading edge of the triangle is that which exhibits the largest extension in a direction perpendicular to the scanning direction or to the scan line. With most triangle forms, traversing invisible pixels to a large extent is thereby avoided, and the percentage of scanned invisible pixels rises only for very narrow triangles.
The scan lines may be defined either horizontally or vertically or even with changing orientations, depending on the triangle to be examined. In practice, it is expedient to restrict scan conversion to horizontal scan lines, as this aligns the scan with the display-refreshing scan and, moreover, a memory access can typically be optimized only for one scan axis. If the scan lines are horizontally defined, the leading edge of the triangle is defined by the two vertices exhibiting the largest difference regarding their y co-ordinates. In order to assure symmetrical behavior after rasterization or scan conversion, it is desirable to change the vertical and horizontal scanning directions as a function of the inclination of the leading edge and as a function of the orientation of the triangle, respectively. FIG. 27 shows different scanning directions for different types of triangles. As can be seen, for the triangle of type A, the horizontal scanning direction is defined in the positive x direction, and the vertical scanning direction is defined in the positive y direction. For the triangle of type B, the horizontal scanning direction is defined in the negative x direction, and the vertical scanning direction is defined in the positive y direction. For the type C triangle, the horizontal scanning direction is defined in the positive x direction, and the vertical scanning direction is defined in the negative y direction, and for the type D triangle, the vertical scanning direction is defined in the negative y direction, and the horizontal scanning direction is defined in the negative x direction.
In the following, a more detailed description will be given of the memory subsystem mentioned with regard to FIGS. 22 and 23. Known graphics systems typically use dynamic random access memories (e.g., synchronous DRAMs) for frame buffer storage. After the performance of the rasterizer has been determined by the memory bandwidth, it is desirable to communicate with the memory in an efficient manner.
Large frame buffers (e.g., 1600xc3x971280xc3x9732 bits-8M bits) can be accommodated in only a small number of memory components. For assuring an adequate bandwidth, the memory is accessed via a broad path, and the same is limited only by the number of inputs/outputs (I/Os) present at the connection between the graphics chip and the memory (e.g., 128 data bits). Using modern technologies, such as double data rate transmission, frame buffers having bandwidths of more than 2 GByte/sec per graphics control can be achieved. However, this bandwidth is not available for the entirely random access.
A DRAM array consists of rows and columns, and access within one row (page) to varying columns will normally be very fast. Synchronous DPAMs can transfer data in each clock cycle, provided that they remain in the same row. Passing to a different row is equivalent to consuming several clock cycles for closing the old column and opening the new one.
These cycles cannot be utilized for actual data transmission, so that the overall bandwidth is reduced. In order to minimize this effect, modern DRAMs contain some, 2 to 4, memory banks in which different rows may be open. An efficient image rendering system must take these properties into account in order to be able to yield optimum performance.
A known technique in memories is referred to as xe2x80x9cmemory tilingxe2x80x9d, i.e., the subdivision of the memory into blocks or blocks. In this case, rectangular-shaped areas of a mapping screen are mapped to blocks (blocks) in the memory. Small triangles have a tendency to completely fall into one block, which means that these do not lead, during image rendering, to page defaults in accessing the memory. The graphics systems properties for processing triangles which intersect several blocks, i.e., which extend over several blocks, can be enhanced by mapping adjacent blocks onto different memory banks in the form of a chessboard. One example of a potential memory partitioning is shown in FIG. 28, in which each block has a size of 2 Kbytes.
From U.S. Pat. No. 5,367,632, a graphics system is known which has a plurality of graphics rendering elements arranged in the manner of pipelines, each pipeline being associated with a rasterization with a corresponding memory. The individual memories are conventional memory elements which per se each form a frame buffer for the respective pipeline. The memories are not arranged in any specific organization.
U.S. Pat. No. 5,821,944 describes xe2x80x9cmemory tilingxe2x80x9d, wherein a screen area, onto which a graphic primitive is to be mapped, is subdivided into a plurality of fields or blocks. Specification of the blocks is followed by a two-step scan, and it is established which of the blocks comprise a portion of the graphic primitive to be processed. Subsequently, the blocks which have just been determined are scanned in the second step. The individual blocks are selected so as to be associated with corresponding memory areas, the memory areas associated with the respective blocks being filed in a cache memory during the processing.
The graphics systems known from the prior art for processing three-dimensional images are disadvantageous, however, in that optimum utilization of the memory capacities is not ensured. For this reason, and on the grounds of the rasterization methods known from the prior art, the performance of these systems is limited.
It is the object of the present invention to provide a method for rasterizing a graphic primitive which exhibits increased performance compared with the methods known in the prior art.
In accordance with a first aspect, the present invention provides a method for rasterizing a graphic primitive in a graphics system for generating pixel data for the graphic primitive, starting from graphic primitive description data, with the graphics system having a memory divided up into a plurality of blocks, each of which is associated with a predetermined one of a plurality of areas on a mapping screen. In a first step, the pixels associated with the graphic primitive are scanned in one of the plurality of blocks into which the graphic primitive extends, and this step is repeated until all pixels associated with the graphic primitive have been scanned in each of the plurality of blocks into which the graphic primitive extends. Subsequently, the pixel data obtained are output for further processing.
In accordance with a second aspect, a method for rasterizing a graphic primitive in a graphics system is provided for generating pixel data for the graphic primitive, starting from graphic primitive description data, the graphics system including a plurality of graphics processing pipelines. Initially, a plurality of adjacent pixels are simultaneously scanned, with the adjacent pixels forming a cluster, at least one of the plurality of adjacent pixels being associated with the graphic primitive, and with the number of the pixels being simultaneously scanned depending on the number of graphics processing pipelines in the graphics system. Subsequently, this step is repeated until all pixels associated with the graphic primitive have been scanned, and, finally, all the pixel data are output.
The present invention is based on the realization that the performance of graphics processing systems can be increased in that, on the one hand, the graphic primitives to be scanned are traversed in an xe2x80x9cintelligentxe2x80x9d manner and/or that, on the other hand, the performance of the system is increased by a further parallelization of data processing.
In accordance with the present invention, a method is taught which implements a xe2x80x9cmonolithic algorithmxe2x80x9d in which all of the aspects explained above can be used together, individually or in any combination so as to increase the system""s performance. This results in a xe2x80x9cscalable architecturexe2x80x9d of the graphics processing means to be used.
Several image-rendering pipelines are supported on one individual chip such that each of the same processes a different pixel of a memory word. This requires that the parallel scan converter functions in an operating mode referred to as locked scan. This means that the pixels processed in parallel always have a fixed geometric relationship with one another (pixel cluster). This facilitates hardware implementation with regard to the memory sub-system. Furthermore, this enables application of the method to chips with several image-rendering pipelines, independent of the chip layout.
A further advantage of the method is that it is possible to combine several individual chips (see above) in one system so as to increase the performance thereof with each chip added. In addition, different chips in the system may serve to fulfil different tasks and to process a different number of pixels, i.e., clusters of different sizes, in parallel. In this case, it is not necessary for the scan converters of the parallel image-rendering chips in the system to be interlocked, since each of same has its own frame buffer memory, and the supply of the polygon data can be decoupled using FIFOs.
A further advantage of the present invention consists in memory utilization. Memory utilization mainly depends on the efficiency of memory control and the memory decision circuit (arbitration circuit). However, even with an ideal memory interface unit, the randomness of the pixel accesses may ruin memory utilization, in particular when scanning small triangles, this effect being even further aggravated in parallel image rendering, where the triangles are subdivided into smaller sections. This problem is avoided in accordance with the present invention, since the same is based on the realization that the number of page defaults per triangle can be minimized in the event that the scan converter has knowledge with regard to mapping the screen areas onto the memory address area (tiling). Further, the average number of memory banks which are simultaneously open may also be reduced which, again, reduces potential collisions in systems where several requests (texture reading operation, graphics rendering engine reading/writing operation, display screen reading operation) are effected with regard to a shared memory element (linked memory).
Another advantage of the present invention is that the efficiency of cache storage of texture data can be improved by the method in accordance with the invention. Typically, bi-linear or tri-linear filtering is used for texture mapping. Without latching the texture data, four (bi-linear filtering) or even eight (tri-linear filtering) unfiltered texels become necessary which would have to be provided by the memory subsystem per pixel. A texture cache memory can benefit from the fact that adjacent texels can be reused during the passing of a scan line. The extent of the reuse strongly depends on the magnification/reduction chosen and is significant if a suitable MIP-map level is selected. Within one scan line, only a very small texture cache memory is required in order to benefit from this advantage. In order to reuse the adjacent texels of a previous scan line, however, the texture cache memory must contain a complete scan line of the texels. In practice, a cache memory size will be selected which is capable of storing scan lines for triangles or graphic primitives of an average size, whereby the efficiency for larger triangles somewhat decreases. In this connection, a further advantage of the present invention is that a maximum length of a scan line can be guaranteed by the scan converter, so that the cache memory can be accurately dimensioned and is normally considerably smaller than that required for storing scan lines for average triangles.