The present application relates to computer graphics rendering systems and methods, and particularly to handling of texture data used by rendering accelerators for 3D graphics.
One of the driving features in the performance of most single-user computers is computer graphics. This is particularly important in computer games and workstations, but is generally very important across the personal computer market.
For some years the most critical area of graphics development has been in three-dimensional (xe2x80x9c3Dxe2x80x9d) graphics. The peculiar demands of 3D graphics are driven by the need to present a realistic view, on a computer monitor, of a three-dimensional scene. The pattern written onto the two-dimensional screen must therefore be derived from the three-dimensional geometries in such a way that the user can easily xe2x80x9cseexe2x80x9d the three-dimensional scene (as if the screen were merely a window into a real three-dimensional scene). This requires extensive computation to obtain the correct image for display, taking account of surface textures, lighting, shadowing, and other characteristics.
The starting point (for the aspects of computer graphics considered in the present application) is a three-dimensional scene, with specified viewpoint and lighting (etc.). The elements of a 3D scene are normally defined by sets of polygons (typically triangles), each having attributes such as color, reflectivity, and spatial location. (For example, a walking human, at a given instant, might be translated into a few hundred triangles which map out the surface of the human""s body.) Textures are xe2x80x9cappliedxe2x80x9d onto the polygons, to provide detail in the scene. (For example, a flat carpeted floor will look far more realistic if a simple repeating texture pattern is applied onto it.) Designers use specialized modelling software tools, such as 3D Studio, to build textured polygonal models.
The 3D graphics pipeline consists of two major stages, or subsystems, referred to as geometry and rendering. The geometry stage is responsible for managing all polygon activities and for converting three-dimensional spatial data into a two-dimensional representation of the viewed scene, with properly-transformed polygons. The polygons in the three-dimensional scene, with their applied textures, must then be transformed to obtain their correct appearance from the viewpoint of the moment; this transformation requires calculation of lighting (and apparent brightness), foreshortening, obstruction, etc.
However, even after these transformations and extensive calculations have been done, there is still a large amount of data manipulation to be done: the correct values for EACH PIXEL of the transformed polygons must be derived from the two-dimensional representation. (This requires not only interpolation of pixel values within a polygon, but also correct application of properly oriented texture maps.) The rendering stage is responsible for these activities: it xe2x80x9crendersxe2x80x9d the two-dimensional data from the geometry stage to produce correct values for all pixels of each frame of the image sequence.
The most challenging 3D graphics applications are dynamic rather than static. In addition to changing objects in the scene, many applications also seek to convey an illusion of movement by changing the scene in response to the user""s input. Whenever a change in the orientation or position of the camera is desired, every object in a scene must be recalculated relative to the new view. As can be imagined, a fast-paced game needing to maintain a high frame rate will require many calculations and many memory accesses.
FIG. 2 shows a high-level overview of the processes performed in the overall 3D graphics pipeline. However, this is a very general overview, which ignores the crucial issues of what hardware performs which operations.
While the geometry stages of the 3D pipeline are traditionally left to the host CPU with its powerful computational capabilities, the actual drawing of pixels to the 2D display is called rendering. Rendering is best performed by specialized hardware or the pixel engine, also called the 3D hardware accelerator. At the top of the 3D graphics pipeline, the bottleneck is how fast the calculations can be performed. At the rendering stage the bottleneck is memory accessxe2x80x94how fast the pixel reads and writes to the frame buffer (display memory)xe2x80x94and other special purpose memory blocks can be performed. The renderer must be able to process thousands of polygons for each frame which, as mentioned above, must further be updated many times each second in order to sustain an illusion of motion.
Modern computer systems normally manipulate graphical objects as high-level entities. For example, a solid body may be described as a collection of triangles with specified vertices, or a straight line segment may be described by listing its two endpoints with three-dimensional or two-dimensional coordinates. Such high-level descriptions are a necessary basis for high-level geometric manipulations, and also have the advantage of providing a compact format which does not consume memory space unnecessarily.
Such higher-level representations are very convenient for performing the many required computations. For example, ray-tracing or other lighting calculations may be performed, and a projective transformation can be used to reduce a three-dimensional scene to its two-dimensional appearance from a given viewpoint. However, when an image containing graphical objects is to be displayed, a very low-level description is needed. For example, in a conventional CRT display, a xe2x80x9cflying spotxe2x80x9d is moved across the screen (one line at a time), and the beam from each of three electron guns is switched to a desired level of intensity as the flying spot passes each pixel location. Thus at some point the image model must be translated into a data set which can be used by a conventional display. This operation is known as xe2x80x9crendering.xe2x80x9d
The graphics-processing system typically interfaces to the display controller through a xe2x80x9cframe storexe2x80x9d or xe2x80x9cframe bufferxe2x80x9d of special two-port memory, which can be written to randomly by the graphics processing system, but also provides the synchronous data output needed by the video output driver. (Digital-to-analog conversion is also provided after the frame buffer.) This interface relieves the graphics-processing system of most of the burden of synchronization for video output. Nevertheless, the amounts of data which must be moved around are very sizable, and the computational and data-transfer burden of placing the correct data into the frame buffer can still be very large.
Even if the computational operations required are quite simple, they must be performed repeatedly on a large number of datapoints. For example, in a typical high-end configuration, a display of 1280xc3x971024 elements may need to be refreshed at 85 Hz, with a color resolution of 24 bits per pixel. If blending is desired, additional bits (e.g. another 8 bits per pixel) will be required to store an xe2x80x9calphaxe2x80x9d or transparency value for each pixel. This implies manipulation of more than 3 billion bits per second, without allowing for any of the actual computations being performed. Thus it may be seen that this is an environment with unique data manipulation requirements.
If the display is unchanging, no demand is placed on the rendering operations. However, some common operations (such as zooming or rotation) will require every object in the image space to be re-rendered. Slow rendering will make the rotation or zoom appear jerky. This is highly undesirable. Thus efficient rendering is an essential step in translating an image representation into the correct pixel values. This is particularly true in animation applications, where newly rendered updates to a computer graphics display must be generated at regular intervals.
The rendering requirements of three-dimensional graphics are particularly heavy. One reason for this is that, even after the three-dimensional model has been translated to a two-dimensional model, some computational tasks may be bequeathed to the rendering process. (For example, color values will need to be interpolated across a triangle or other primitive.) These computational tasks tend to burden the rendering process. Another reason is that since three-dimensional graphics are much more lifelike, users are more likely to demand a fully rendered image. (By contrast, in the two-dimensional images created e.g. by a GUI or simple game, users will learn not to expect all areas of the scene to be active or filled with information.)
FIG. 2 is a very high-level view of other processes performed in a 3D graphics computer system. A three dimensional image which is defined in some fixed 3D coordinate system (a xe2x80x9cworldxe2x80x9d coordinate system) is transformed into a viewing volume (determined by a view position and direction), and the parts of the image which fall outside the viewing volume are discarded. The visible portion of the image volume is then projected onto a viewing plane, in accordance with the familiar rules of perspective. This produces a two-dimensional image, which is now mapped into device coordinates. It is important to understand that all of these operations occur prior to the operations performed by the rendering subsystem of the present invention.
There are different ways to add complexity to a 3D scene. Creating more and more detailed models, consisting of a greater number of polygons, is one way to add visual interest to a scene. However, adding polygons necessitates paying the price of having to manipulate more geometry. 3D systems have what is known as a xe2x80x9cpolygon budget,xe2x80x9d an approximate number of polygons that can be manipulated without unacceptable performance degradation. In general, fewer polygons yield higher frame rates.
The visual appeal of computer graphics rendering is greatly enhanced by the use of xe2x80x9ctextures.xe2x80x9d A texture is a two-dimensional image which is mapped into the data to be rendered. Textures provide a very efficient way to generate the level of minor surface detail which makes synthetic images realistic, without requiring transfer of immense amounts of data. Texture patterns provide realistic detail at the sub-polygon level, so the higher-level tasks of polygon-processing are not overloaded. See Foley et al., Computer Graphics: Principles and Practice (2.ed. 1990, corr. 1995), especially at pages 741-744; Paul S. Heckbert, xe2x80x9cFundamentals of Texture Mapping and Image Warping,xe2x80x9d Thesis submitted to Dept. of EE and Computer Science, University of California, Berkeley, Jun. 17, 1994; Heckbert, xe2x80x9cSurvey of Computer Graphics,xe2x80x9d IEEE Computer Graphics, November 1986, pp.56; all of which are hereby incorporated by reference. Game programmers have also found that texture mapping is generally a very efficient way to achieve very dynamic images without requiring a hugely increased memory bandwidth for data handling.
A typical graphics system reads data from a texture map, processes it, and writes color data to display memory. The processing may include mipmap filtering which requires access to several maps. The texture map need not be limited to colors, but can hold other information that can be applied to a surface to affect its appearance; this could include height perturbation to give the effect of roughness. The individual elements of a texture map are called xe2x80x9ctexels.xe2x80x9d
Awkward side-effects of texture mapping occur unless the renderer can apply texture maps with correct perspective. Perspective-corrected texture mapping involves an algorithm that translates xe2x80x9ctexelsxe2x80x9d (pixels from the bitmap texture image) into display pixels in accordance with the spatial orientation of the surface. Since the surfaces are transformed (by the host or geometry engine) to produce a 2D view, the textures will need to be similarly transformed by a linear transform (normally projective or xe2x80x9caffinexe2x80x9d). (In conventional terminology, the coordinates of the object surface, i.e. the primitive being rendered, are referred to as an (s,t) coordinate space, and the map of the stored texture is referred to a (u,v) coordinate space.) The transformation in the resulting mapping means that a horizontal line in the (x,y) display space is very likely to correspond to a slanted line in the (u,v) space of the texture map, and hence many additional reads will occur, due to the texturing operation, as rendering walks along a horizontal line of pixels.
Due to the extremely high data rates required at the end of the rendering pipeline, many features of computer architecture take on new complexities in the context of computer graphics (and especially in the area of texture management).
In defining computer architectures, one of the basic trade-offs is memory speed versus cost: faster memories cost more. SRAMs are much more expensive (per bit) than DRAMs, and DRAMs are much more expensive (per bit) than disk memory. The price of all of these has been steadily decreasing over time, but this relationship has held true for many years. Thus computer architectures usually include multiple levels of memory: the smallest and fastest memory is most closely coupled to the processor, and one or more layers successively larger, slower, and cheaper.
The fastest memory is that which is completely integrated with the processor. An essential part of microprocessor architecture is various read-write registers, which are intimately intertwined with the hardware logic circuits of the microprocessor. Some of these registers have dedicated functions, but others may be provided for xe2x80x9cscratchpadxe2x80x9d space usable by software. These registers are often overlooked in the memory hierarchy; but many of them can be directly accessed by software, and they may therefore be thought of as the innermost circle of the memory hierarchy. (A variant on this is a multi-chip module which includes additional memory in the same package with a microprocessor chip. An example of this is the DS5000 module from Dallas Semiconductor, which includes a dedicated local bus, with a battery-backed SRAM, in the same sealed package as a microcontroller.)
When the central processing unit (CPU) executes software, it will often have to read or write to an arbitrary (unpredictable) address. This address will correspond to some specific portion of some specific memory chip in the main memory. (In a virtual memory system, an arbitrary address may correspond to a physical location which is in main memory or mass storage (e.g. disk). In such systems, address translation performs fetches from mass storage if needed, transparently to the CPU. Virtual memory management, like cache management, is an important architectural design choice, and xe2x80x9cmemory managementxe2x80x9d logic often performs functions related to virtual memory management as well as to cache management. However, the needs and impact of virtual memory operation are largely irrelevant to the disclosed innovations, and will be largely ignored in the present application.) However, main memory typically has a minimum access time which is several times as long as the basic CPU clock cycle. This causes xe2x80x9cwait states,xe2x80x9d which are undesirable. The net effective speed of a large DRAM memory can be increased by using bank organization and/or page mode accesses; but such features can still provide only a limited speed improvement, and net effective speed of a large DRAM memory (as seen by the processor) will still typically be much slower than that of the processor. (For example, a 500 MHz processor will have a clock period of about 2 nsec. However, low-priced DRAM memories typically have access times of 50 ns or more. Thus, when a 2 ns processor attempts to read 50 ns DRAM memory, the processor must wait for several of its cycles until the memory returns data. Such xe2x80x9cwait statesxe2x80x9d degrade the net performance of the processor.) Thus, further speed improvement is still needed, and other techniques must be used to achieve this.
The addresses actually used by almost any software program will be found to include a high concentration of accesses within a few neighborhoods of address space. Thus, it has long been recognized that computer performance, for a given price, can be improved by using a small amount of fast (expensive) memory to provide temporary storage for recently-accessed addresses. Whenever the same address is accessed again, it can be read from the fast memory, instead of the slower main memory. Such memory is called cache memory. One or more layers of cache memory may be used.
Usually cache memory includes one or more fast SRAM chips, which are closely coupled to the CPU by a high-speed bus. A variation of this, used in the Intel x86 processes, is an on-chip cache memory which is integrated on the same chip with a microprocessor. Such on-chip cache memory is often used in combination with a larger external cache. Thus, this is one of the first examples, in PC architectures, of multi-level cache hierarchy. Multi-level cache architectures have been widely discussed in the last decade, and have been used in a number of high-speed computers.
The main memory usually consists of volatile semiconductor random access memory (typically DRAM). This will normally be organized with various architectural tricks to hasten average access time, but only a limited amount of improvement can be readily achieved by such methods. (A small amount of nonvolatile memory, e.g. ROM, EPROM, EEPROM, or flash EPROM, will also be used to store initialization routines. Some of these technologies have a cost per bit which is nearly as low as DRAM, but these technologies tend to have access times which are slower than DRAM. Moreover, since these are read-only or read-mostly memories, they are not suited for general-purpose random-access memory.)
Behind the main memory, there will be one or more layers of nonvolatile mass storage. Nearly any computer will have a magnetic disk drive, and may also have optical read-only disk drive (CDROM), magnetooptic memory, magnetic tape, etc.
Some further background discussion of cache management can be found in Przybylski, Cache and Memory Hierarchy Design (1990); Handy, The Cache Memory Book (1998); Hennessy and Patterson, Computer Architecture: a Quantitative Approach (2.ed. 1996); Hwang and Briggs, Computer Architecture and Parallel Processing (1984); and Loshin, Efficient Memory Programming (1998); all of which are hereby incorporated by reference.
The above general discussion shows why a cache memory may be desirable in principle. However, there are significant variations possible in the implementation of cache memory. Some of the details of cache operation will now be reviewed, to show where important design choices appear.
When the CPU needs to read data, it outputs the address and activates the control signals. In a cache system, the cache controller will check the most significant bits of this address against a table of cached data. If a match is found (i.e. a xe2x80x9ccache hitxe2x80x9d occurs), the controller must find where this data lies in the fast memory of the cache. The cache controller blocks or halts the read from main memory, and instead commands the cache memory to output the contents of the physical address at which the correct data is stored.
In a direct-mapped cache system, each line of data, if present, can only be in one place in the cache memory""s address space. Thus, as soon as the cache controller detects a hit, it immediately knows what physical address to access in the cache memory SRAM. By contrast, in a fully associative cache memory, a block of data may be anywhere in the cache. The risk in a direct-mapped system is that some combinations of lines cannot simultaneously be present in cache. The penalty in a fully associative system is that the controller has to look through a table of all cache addresses to find the desired block of data. Thus, many systems use set-associative mapping (where a given block of data may be anywhere within a proper subset of the cache""s physical address space).
A set-associative cache architecture will commonly be described as having a certain number of xe2x80x9cways,xe2x80x9d e.g. xe2x80x9c4-wayxe2x80x9d or xe2x80x9c2-way.xe2x80x9d As with a direct-mapped cache architecture, the most significant bits of the address define which line in cache can contain the cached data. However, with set-associative cache architectures, each line contains several units of data. In a 4-way set-associative cache, each line will contain four xe2x80x9cways,xe2x80x9d and each way consists of tag bits plus the corresponding data bits.
If no match is found (i.e. a xe2x80x9ccache missxe2x80x9d occurs), the controller allows an access to main memory to continue (or begin). When the data is returned from main memory (which will typically require at least several CPU clock cycles), the CPU receives it immediately, and the cache controller loads it into the cache memory. The cache location used for new data may be randomly chosen, or may be chosen by computation of which data is least-recently used.
If a cache hit occurs, the cache controller must find where this data lies in the fast memory of the cache. The cache controller blocks or halts the read from main memory, and instead commands the cache memory to output the contents of the physical address at which the correct data is stored.
Personal computer systems, unlike larger computer systems, have historically used a single-processor architecture. In such architectures, a single microprocessor runs the application software. (However, many other microprocessors, microcontrollers, or comparably complex pieces of programmable logic, have been employed in support tasks, particularly for I/O management.) By contrast, supercomputers, mainframes, and many minicomputers use multiprocessing systems. In such systems many CPUs are active at the same time to execute the primary application software, and the allocation of tasks is typically at least partly invisible to the application software.
Thus, personal computer designers have not needed to pay much attention to the data synchronization issues which can be so critical in larger systems. However, direct-memory-access is typically provided in personal computer systems, and presents some of the same issues as a true multiprocessing system.
One feature which rapidly became standard, in the early development of personal computer architectures, is direct memory access. If peripheral devices are allowed to access memory directly, then the CPU can perform other tasks while a long transfer of data is occurring. However, the possibility that data may be accessed independently of the CPU means that problems of data coherency may arise.
The simple approach to such problems of data coherency has been to use pure write-through caching operation. This avoids coherency problems, but means that write operations derive no benefit whatsoever from the presence of a cache.
The unit of data handled by the cache is referred to as a xe2x80x9clinexe2x80x9d of data. (For example, in the 486""s 8 KB on-chip cache, each cache line is 16 bytes long.)
Cache line size can impact system performance. If the line size is too large, then the number of blocks that can fit in the cache is reduced. In addition, as the line length is increased the latency for the external memory system to fill a cache line increases, reducing overall performance.
Due to the complexity and criticality of caching and other memory management issues, a wide variety of custom VLSI integrated circuits for memory management have been offered by various chip vendors. One of particular interest is the Intel 82495XP Cache controller chip. This chip (which was originally developed for use with Intel""s 860 RISC processor) permits block-wise programmation of cache modes, so that cache modes can be assigned to different blocks of memory.
In many areas of computer graphics a succession of slowly changing pictures are displayed rapidly one after the other, to give the impression of smooth movement, in much the same way as for cartoon animation. In general the higher the speed of the animation, the smoother (and better) the result.
When an application is generating animation images, it is normally necessary not only to draw each picture into the frame buffer, but also to first clear down the frame buffer, and to clear down auxiliary buffers such as depth (Z) buffers, stencil buffers, alpha buffers and others. A good treatment of the general principles may be found in Computer Graphics: Principles and Practice, James D. Foley et al., Reading MA: Addison-Wesley. A specific description of the various auxiliary buffers may be found in The OpenGL Graphics System: A Specification (Version 1.0), Mark Segal and Kurt Akeley, SGI.
In most applications the value written, when clearing any given buffer, is the same at every pixel location, though different values may be used in different auxiliary buffers. Thus the frame buffer is often cleared to the value which corresponds to black, while the depth (Z) buffer is typically cleared to a value corresponding to infinity.
The time taken to clear down the buffers is often a significant portion of the total time taken to draw a frame, so it is important to minimize it.
A recurrent problem with texture mapping is the amount of data each texture map contains. If it is of high quality and detail it may require a substantial amount of storage space. The size of texture maps may be increased if mipmap filtering is supported. Simply moving textures from one physical storage location to another may be a time consuming operation. In a normal graphics system the time taken to transfer a texture from disk or system memory to the graphics system may be significantly more than the time taken to apply the texture. Network applications, in which the application and graphics system are on separate machines linked by a low bandwidth connection, aggravate this problem. Improvements can be made by caching the texture locally in the graphics system, but the time taken to transfer it just once may be prohibitive.
Caching would be particularly desirable for texture management in 3D graphics. The desirability for some form of texture caching is easily demonstrated by a simple calculation. If the target performance is to do trilinear filtering in a single cycle, then 8 texels per output fragment are required. If each texel is in true color (i.e. 32 bits per pixel), then the texture read bandwidth is 32 bytes per cycle, or (assuming a 100 MHz bus) 3.2 GB/s. With clever cache design this can be reduced to 1.25 texels read per pixel (assuming the texture maps are very much larger than will fit into the cache), i.e. 500 MB/s. (Note the trivial case where the texture maps fit into cache and are already loaded is an easy one to solve, but isn""t useful with real world scenarios.) Caching texture maps is not a new idea of itself, but previous implementations leave room for improvement.
Cache updating during a graphics operation is normally performed with an updating rule, such as first-in-first-out (FIFO), which disfavors the oldest data in the cache. However, the present inventors have realized that this is not optimal when the end of a scan line is reached. Instead, the updating rules are changed, when the end of a scan line is reached, to favor retention of the oldest data in the cache. This helps to increase the likelihood that some old data in the cache may still be useful in rendering the next scan line. An important consequence of this is that the fraction of cache misses (and hence performance) will drop off gradually, rather than precipitously, for line lengths which exceed the cache size.
Notable (and separately innovative) features of the texture caching architecture described in the present application include at least the following: expedited loading of texel data (preloading, not just prefetching); an improved definition of keys (rather than addresses) for cache lookup; and an innovative cache replacement policy.
The cache replacement policy comes into effect when a cache lookup has failed and new data needs to be loaded. At the start of day it can be loaded anywhere in the cache, but very soon all the free locations in the cache will have been used up and something must be deleted from the cache to make room. If the cache is direct mapped, there may be no choice or, if the cache is set associative, only a very limited choice. The disclosed cache is fully associative so any patch of texture data can be written into any cache line.
The computer literature describes many cache replacement policies, including e.g. Least Recently Used, First In First Out (Oldest), Least Frequently Used and Random. In a CPU, the memory accesses can be considered random (when viewed across the full range of software which can be run), but localized so the best policy is generally least-recently-used (LRU). Also the size of the cache is usually large and is set associative or direct mapped.
In 3D graphics the memory access patterns are much more predictable. Consider rasterizing a triangle. As we move along a scan line we will trace out a monotonic curve through the texture map (it will only fold back on itself if some form of wrapping is used). The next scan line will trace out a new monotonic curve, but this will be close to and approximately parallel to the first curve. This should demonstrate that if the cache is big enough then the FIFO (or oldest) replacement policy works best (in fact the LRU policy is equivalent in this case). Needless to say the oldest policy is very cheap to implement. The other point hidden here is that, in general, we are not trying to cache the whole texture map because the size of cache we can afford is too small. What we are trying to capitalise on is the coherency in the texture map from one scan line to the next.
This works well providing the cache is big enough to hold a scan line""s worth of data. Consider the (simple) case where the number of texels required for the scan line is less than or equal to what will fit into the cache. On the next scan line the texels needed are present in the cache so we have the maximum hit rate we can expect. If the number of texels for a scan line increase by one so they no longer fit into the cache then with the Oldest replacement policy the last pixel on a scan line will replace the oldest texel data, which corresponds to the texel data for the first pixel on the next scan line. When we get to the first pixel on the next scan line we get a cache miss and it will replace the texel data for the next pixel on the scan line. From now on we will get a continuous stream of misses and have lost all scan line coherency in the texture map. This is an abrupt transition from having very good cache use to very poor cache use. In performance terms, this change is like falling off a cliff. A more gradual degradation in performance is preferable.
This is achieved, in the presently preferred embodiment, by modifying the replacement policy to replace the oldest data (called KeepOldest) except when this results in replacing the texels at the start of the scan line we are working on. Basically we fill up the cache as we are working along a scan line, but at the point when we will wrap around in the cache, rather than going back to the start again, we loop around in the last few entries in the cache. This can reduce our performance on these xe2x80x9coverflowxe2x80x9d pixels (the maximum we can cope with without stalling is determined by the size of the loop region), but does preserve the cache coherency for many pixels at the start of the next scan line. The number of entries to loop over is programmable, but set to a minimum of 8. With this scheme the performance will degrade slowly once the cache has overflowed.