This invention relates generally to computer graphics systems. More particularly, the invention relates to methods and apparatus for enhancing frame buffer memory performance.
Frame buffer memories and the bandwidth problem. A frame buffer memory is typically used in a computer graphics system to store all of the color information necessary to control the appearance of each pixel on a display device. Color information is usually stored in terms of RGBA components (a red intensity component, a green intensity component, a blue intensity component, and an xe2x80x9calphaxe2x80x9d transparency value). In addition, the frame buffer memory is often used to store non-color information that is accessed during the rendering and modification of images. For example, xe2x80x9cZxe2x80x9d or xe2x80x9cdepthxe2x80x9d values may be stored in the frame buffer memory to represent the distance of pixels from the viewpoint, and stencil values may be stored in the frame buffer memory to restrict drawing to certain areas of the screen. In operation, upstream graphics hardware issues a stream of read and write commands with accompanying addresses directed to the frame buffer memory. In turn, a frame buffer memory controller receives the command stream and responds to each command by operating the memory devices that make up the frame buffer memory itself. Depending on the rendering modes enabled at any given time, a single frame buffer memory access command issued by upstream hardware may result in numerous accesses to the frame buffer memory by the frame buffer memory controller. For further background regarding frame buffer memories and their uses, see James D. Foley et al., Computer Graphics: Principles and Practice chapter 18 (2d ed., Addison-Wesley 1990) and Mason Woo et al., OpenGL Programming Guide chapter 10 (2d ed., Addison-Wesley 1997).
Over time, the resolution capabilities of display devices have increased, and consequently so has the amount of information (both color and non-color) that must be stored in the frame buffer memory. In addition, refresh cycles of display devices have become shorter. The result has been that access rates for modern frame buffer memories have become extremely high. Due to cost, the vast majority of frame buffer memories are constructed using dynamic random access memories (xe2x80x9cDRAMsxe2x80x9d) instead of static random access memories (xe2x80x9cSRAMsxe2x80x9d) or specially-ported video random access memories (xe2x80x9cVRAMsxe2x80x9d). Unfortunately, DRAMs present certain performance problems related to, for example, the need to activate and deactivate pages, and the need to refresh storage locations regularly. Although DRAM memory device clock frequencies have increased over time, their latency characteristics have not improved so dramatically. Thus, numerous techniques have been proposed to increase DRAM frame buffer memory bandwidth.
Memory devices: banks, bursts, SDR and DDR. One technique that has been employed to increase DRAM frame buffer memory bandwidth has been to divide the memory devices internally into independently-operating banks, each bank having its own set of row (page) and column addresses. The use of independent banks improves memory bandwidth because, to the extent bank accesses can be interleaved with proper memory mapping, a row in one bank can be activated or precharged while a row in a different bank is being accessed. When this is possible, the wait time required for row activation and precharge may be concealed so that it does not negatively impact memory bandwidth.
Another technique has been to employ memory devices that support burst cycles. In a burst memory cycle, multiple words of data (each corresponding to a different but sequential address) are transferred into or out of the memory even though only a single address was specified at the beginning of the burst. The memory device itself increments or decrements the addresses appropriately during the burst based on the initially specified address. Burst operation increases memory bandwidth because it creates xe2x80x9cfreexe2x80x9d command cycles during the burst that otherwise would have been occupied by the specification of sequential addresses. The free command cycles so created may be used, for example, to precharge and activate rows in other banks in preparation for future memory accesses.
In a single-data-rate (xe2x80x9cSDRxe2x80x9d) memory device, data may be transferred only once per clock cycle. A double-data-rate (xe2x80x9cDDRxe2x80x9d) memory device, on the other hand, is capable of transferring data on both phases of the clock. Both SDR and DDR devices are capable of burst-mode memory accesses. For SDR devices, the minimum burst length that can create a free command cycle is two consecutive words (column addresses). The absolute minimum burst length for SDR devices is one word (column address). An example of an SDR device is the NEC uPD4564323 synchronous DRAM, which is capable of storing 64 Mbits organized as 524,288 wordsxc3x9732 bits xc3x974 banks. For double-data-rate devices, the minimum burst length that can create a free command cycle is four consecutive words (column addresses). The absolute minimum burst length for DDR devices is two consecutive words (column addresses). An example of a DDR device is the SAMSUNG KM416H430T hyper synchronous DRAM, which is capable of storing 64 Mbits organized as 1,048,576 wordsxc3x9716 bits xc3x974 banks.
The problem of column coherency in a graphics command stream. In order to capitalize on the burst-mode capabilities of frame buffer memory devices, prior art graphics systems depended on the natural occurrence of sequential column addresses in the various streams of read and write commands issued by upstream hardware. For example, with coherent triangle rendering and appropriate mapping of x,y screen space to RAM address space, many pairs of sequential column addresses could be made to occur naturally in the stream of pixel commands requested by a rasterizer. Indeed, such a solution worked adequately in times when DDR memory devices were not available.
Now, however, DDR memory devices are often used to construct the frame buffer memory. For prior art systems to capitalize on the burst-mode capabilities of a DDR device, a substantial number of quadruplets of sequential column addresses would have to occur naturally in the command stream; but the natural production of a substantial number of quadruplets of sequential column addresses is difficult if not impossible to achieve with mere memory mapping. This is especially true now that graphics applications are capable of drawing smaller triangles (having fewer pixels per triangle) than did the applications of the past.
The problem of page coherency in a graphics command stream. Changing from one row to another row in the same bank of a memory device (also known as a same-bank page change) requires wait time for closing the previous page and activating the new page. Prior art graphics systems employed two techniques in attempting to avoid this performance penalty. First, the mapping of x,y screen space to RAM address space was constructed so as to make same-bank page changes occur as infrequently as possible. Second, memory access commands were sorted into FIFO buffers according to bank: Specifically, two FWFOs per memory device bank were employed so that access commands directed to the same bank of a memory device could be further sorted according to page. Of course, if only two FWFOs per bank are employed in this manner, then grouping is only possible for up to two different pages within a single bank. If a memory access command appeared in the command stream directed to a third page within the bank, then one of the FIFOs would have to be flushed. Adding more FIFOs per bank in such a system might provide added efficiency because it would allow page-wise grouping for more than two of the bank""s pages at one time. On the other hand, such a solution would be expensive because of the number of FIFOs required to implement it, particularly in the case of the newer 4-bank memory devices. Moreover, the solution would be wasteful because the FIFOs so provided would rarely all be full at the same time.
Batching and the problem of pixel collisions. Changing from read mode to write mode presents another kind of memory performance penalty because it requires memory dead cycles. In part for this reason, prior art graphics systems have attempted to group as many read operations together as possible before transitioning to write operations, rather than to freely interleave writes with reads when it is not necessary to do so. Such a grouping of memory access commands together is known as xe2x80x9cbatching.xe2x80x9d As alluded to above, in certain rendering modes one frame buffer memory access command issued by upstream hardware may result in numerous frame buffer accesses by the frame buffer controller. For example, in image read-modify-write mode with z test enabled, one frame buffer memory write command may result in four frame buffer accesses: a z buffer read, a z buffer write, an image buffer read, and an image buffer write. Thus, prior art systems have also attempted to batch as many z reads together as possible, as many z writes together as possible, as many image reads together as possible, and as many image writes together as possible.
Such prior art batching systems yielded memory bandwidth efficiencies to the extent that they decreased the frequency of read-to-write transitions and changes from one buffer to another. However, they suffered from at least the following limitation: accesses to the same pixel location had to be placed in separate batches; otherwise the result would be a xe2x80x9cpixel collision.xe2x80x9d This meant that, depending on the vagaries of the command stream, a developing batch might have to be cut short simply because a second access to the same pixel location occurred within a relative few commands from the first access to that pixel location. The result was a decreased average batch size. This problem is even greater in modern graphics systems because modern applications utilize greater depth complexity. Thus, pixel collisions occur more frequently than in the past.
A need therefore exists for a technique that increases average batch size relative to that of prior art systems even when occasional pixel collisions are presented by the graphics command stream.
In one aspect, the invention includes a method and apparatus for conserving frame buffer memory bandwidth by performing a depth comparison between colliding pixels at batch building time. If the incoming pixel fails the depth comparison, then it may be xe2x80x9ctossedxe2x80x9d and excluded from any batches currently under construction. Importantly, the batch building process may then continue without the need for a batch flush responsive to the occurrence of the pixel collision. If the incoming pixel passes the depth comparison, then it may yet be possible to avoid flushing: The current rendering mode of the pipeline is determined. If the current rendering mode does not require read-modify-write operations, then the incoming pixel may be merged with the buffered pixel with which it collides. Merger of the two pixels may be accomplished by overwriting the buffered RGBA pixel components with those of the incoming pixel, but only those components corresponding to asserted bits in the incoming pixel""s BEN. The buffered BEN may be replaced with the logical OR of the stored BEN and the incoming pixel""s BEN.
In another aspect, the batch building buffer is designed so that the locations at which received pixel commands are stored is determined at least in part responsive to the addresses specified by the received pixel commands. Each location of the buffer containing a stored representation of a pixel has a unique valid bit associated with it. The detection of pixel collisions may be accomplished by using the address of the incoming pixel command to map it to a target buffer location, and then testing the valid bit associated with the target buffer location. If the target valid bit is asserted already, then a pixel collision is indicated.
Whereas prior art batch building systems cut batches short and flushed upon the occurrence of any pixel collision, the method and apparatus of the invention is often able to continue building a batch even after the occurrence of a pixel collision. The result is that the average batch size is increased, and the total number of batches is decreased. Thus, frame buffer memory bandwidth is conserved.