This invention relates generally to computer memory systems. More particularly, the invention relates to methods and apparatus for enhancing memory access performance. The invention has particularly beneficial application with regard to frame buffer memories in computer graphics systems.
Frame buffer memories and the bandwidth problem. A frame buffer memory is typically used in a computer graphics system to store all of the color information necessary to control the appearance of each pixel on a display device. Color information is usually stored in terms of RGBA components (a red intensity component, a green intensity component, a blue intensity component, and an xe2x80x9calphaxe2x80x9d transparency value). In addition, the frame buffer memory is often used to store non-color information that is accessed during the rendering and modification of images. For example, xe2x80x9cZxe2x80x9d or xe2x80x9cdepthxe2x80x9d values may be stored in the frame buffer memory to represent the distance of pixels from the viewpoint, and stencil values may be stored in the frame buffer memory to restrict drawing to certain areas of the screen. In operation, upstream graphics hardware issues a stream of read and write commands with accompanying addresses directed to the frame buffer memory. In turn, a frame buffer memory controller receives the command stream and responds to each command by operating the memory devices that make up the frame buffer memory itself. Depending on the rendering modes enabled at any given time, a single frame buffer memory access command issued by upstream hardware may result in numerous accesses to the frame buffer memory by the frame buffer memory controller. For further background regarding frame buffer memories and their uses, see James D. Foley et al., Computer Graphics: Principles and Practice chapter 18 (2d ed., Addison-Wesley 1990) and Mason Woo et al., OpenGL Programming Guide chapter 10 (2d ed., Addison-Wesley 1997).
Over time, the resolution capabilities of display devices have increased, and consequently so has the amount of information (both color and non-color) that must be stored in the frame buffer memory. In addition, refresh cycles of display devices have become shorter. The result has been that access rates for modern frame buffer memories have become extremely high. Due to cost, the vast majority of frame buffer memories are constructed using dynamic random access memories (xe2x80x9cDRAMsxe2x80x9d) instead of static random access memories (xe2x80x9cSRAMsxe2x80x9d) or specially-ported video random access memories (xe2x80x9cVRAMsxe2x80x9d). Unfortunately, DRAMs present certain performance problems related to, for example, the need to activate and deactivate pages, and the need to refresh storage locations regularly. Although DRAM memory device clock frequencies have increased over time, their latency characteristics have not improved so dramatically. Thus, numerous techniques have been proposed to increase DRAM frame buffer memory bandwidth.
Memory devices: banks, bursts, SDR and DDR. One technique that has been employed to increase DRAM frame buffer memory bandwidth has been to divide the memory devices internally into independently-operating banks, each bank having its own set of row (page) and column addresses. The use of independent banks improves memory bandwidth because, to the extent bank accesses can be interleaved with proper memory mapping, a row in one bank can be activated or precharged while a row in a different bank is being accessed. When this is possible, the wait time required for row activation and precharge may be concealed so that it does not negatively impact memory bandwidth.
Another technique has been to employ memory devices that support burst cycles. In a burst memory cycle, multiple words of data (each corresponding to a different but sequential address) are transferred into or out of the memory even though only a single address was specified at the beginning of the burst. The memory device itself increments or decrements the addresses appropriately during the burst based on the initially specified address. Burst operation increases memory bandwidth because it creates xe2x80x9cfreexe2x80x9d command cycles during the burst that otherwise would have been occupied by the specification of sequential addresses. The free command cycles so created may be used, for example, to precharge and activate rows in other banks in preparation for future memory accesses.
In a single-data-rate (xe2x80x9cSDRxe2x80x9d) memory device, data may be transferred only once per clock cycle. A double-data-rate (xe2x80x9cDDRxe2x80x9d) memory device, on the other hand, is capable of transferring data on both phases of the clock. Both SDR and DDR devices are capable of burst-mode memory accesses. For SDR devices, the minimum burst length that can create a free command cycle is two consecutive words (column addresses). The absolute minimum burst length for SDR devices is one word (column address). An example of an SDR device is the NEC uPD4564323 synchronous DRAM, which is capable of storing 64 Mbits organized as 524,288 wordsxc3x9732 bitsxc3x974 banks. For double-data-rate devices, the minimum burst length that can create a free command cycle is four consecutive words (column addresses). The absolute minimum burst length for DDR devices is two consecutive words (column addresses). An example of a DDR device is the SAMSUNG KM416H430T hyper synchronous DRAM, which is capable of storing 64 Mbits organized as 1,048,576 wordsxc3x9716 bitsxc3x974 banks.
The problem of column coherency in a graphics command stream. In order to capitalize on the burst-mode capabilities of frame buffer memory devices, prior art graphics systems depended on the natural occurrence of sequential column addresses in the various streams of read and write commands issued by upstream hardware. For example, with coherent triangle rendering and appropriate mapping of x,y screen space to RAM address space, many pairs of sequential column addresses could be made to occur naturally in the stream of pixel commands requested by a rasterizer. Indeed, such a solution worked adequately in times when DDR memory devices were not available.
Now, however, DDR memory devices are often used to construct the frame buffer memory. For prior art systems to capitalize on the burst-mode capabilities of a DDR device, a substantial number of quadruplets of sequential column addresses would have to occur naturally in the command stream; but the natural production of a substantial number of quadruplets of sequential column addresses is difficult if not impossible to achieve with mere memory mapping. This is especially true now that graphics applications are capable of drawing smaller triangles (having fewer pixels per triangle) than did the applications of the past.
The problem of page coherency in a graphics command stream. Changing from one row to another row in the same bank of a memory device (also known as a same-bank page change) requires wait time for closing the previous page and activating the new page. Prior art graphics systems employed two techniques in attempting to avoid this performance penalty. First, the mapping of x,y screen space to RAM address space was constructed so as to make same-bank page changes occur as infrequently as possible. Second, memory access commands were sorted into FIFO buffers according to bank: Specifically, two FIFOs per memory device bank were employed so that access commands directed to the same bank of a memory device could be further sorted according to page. Of course, if only two FEFOs per bank are employed in this manner, then grouping is only possible for up to two different pages within a single bank. If a memory access command appeared in the command stream directed to a third page within the bank, then one of the FIFOs would have to be flushed. Adding more FIFOs per bank in such a system might provide added efficiency because it would allow page-wise grouping for more than two of the bank""s pages at one time. On the other hand, such a solution would be expensive because of the number of FIFOs required to implement it, particularly in the case of the newer 4-bank memory devices. Moreover, the solution would be wasteful because the FIFOs so provided would rarely all be full at the same time.
A need therefore exists for a technique for sorting memory accesses commands from a graphics command stream by row and bank without a proliferation of FIFOs.
Batching and the problem of pixel collisions. Changing from read mode to write mode presents another kind of memory performance penalty because it requires memory dead cycles. In part for this reason, prior art graphics systems have attempted to group as many read operations together as possible before transitioning to write operations, rather than, to freely interleave writes with reads when it is not necessary to do so. Such a grouping of memory access commands together is known as xe2x80x9cbatching.xe2x80x9d As alluded to above, in certain rendering modes one frame buffer memory access command issued by upstream hardware may result in numerous frame buffer accesses by the frame buffer controller. For example, in image read-modify-write mode with z test enabled, one frame buffer memory write command may result in four frame buffer accesses: a z buffer read, a z buffer write, an image buffer read, and an image buffer write. Thus, prior art systems have also attempted to batch as many z reads together as possible, as many z writes together as possible, as many image reads together as possible, and as many image writes together as possible.
Such prior art batching systems yielded memory bandwidth efficiencies to the extent that they decreased the frequency of read-to-write transitions and changes from one buffer to another. However, they suffered from at least the following limitation: accesses to the same pixel location had to be placed in separate batches; otherwise the result would be a xe2x80x9cpixel collision.xe2x80x9d This meant that, depending on the vagaries of the command stream, a developing batch might have to be cut short simply because a second access to the same pixel location occurred within a relative few commands from the first access to that pixel location. The result was a decreased average batch size. This problem is even greater in modern graphics systems because modern applications utilize greater depth complexity. Thus, pixel collisions occur more frequently than in the past.
In one aspect, a specially-designed buffer facilitates reordering of incoming memory access commands so that the memory access commands may be associated automatically according to their row/bank addresses. When the buffer is flushed, groups of commands are selected for flushing whose row/bank addresses are associated, thereby creating page coherency in the flushed pixel commands that was not present in the incoming command stream. The page coherency so created has the effect of increasing batch size.
Implemented in a computer graphics system, the buffer may include a bus for receiving pixel commands from a pipeline, the pixel commands accompanied by pixel data, a pixel row/bank address and a pixel column address; a row/bank address storage array for storing the pixel row/bank address in a first row/bank address entry; a column address storage array for storing at least some of the MSBs of the pixel column address in a first line; a line-in-use bit for associating the first row/bank address entry with the first line of the column address storage array; and a multi-line pixel data storage array having a first line of pixel entry locations associated with the first line of the column address storage array. Importantly, the storage capacity in the buffer may be dynamically allocated among groups as needed on-the-fly. Thus, numerous small row/bank groups may be stored at one time, or a few large row/bank groups, or any combination in between. Thus, efficient use is made of the storage capacity of the buffer.
In another aspect, batches of pixel commands may be flushed from the buffer according to a special sequence designed to minimize same-bank page changes in the frame buffer memory devices. Specifically, a group may be selected for flushing if its binary bank address is not equal to the binary bank address of the last-flushed group AND is not equal to the bit inverse of the binary bank address of the last-flushed group. Such a selection is especially beneficial for frame buffer memory mappings in which the z information for a given pixel is located in a bank whose binary address is equal to the bit inverse of the binary address of the bank containing the image information for that pixel.
In another aspect, good candidate groups for flushing from the buffer may be chosen according to special criteria based on the binary bank address for the group, the size of the group, and the age of the group. In addition, groups may be partially flushed. If so, a subsequent flush operation may resume flushing a partially-flushed group when to do so would be more beneficial than flushing a different group chosen solely based on its bank address.
In yet another aspect, the first and last pixel commands flushed in any batch are accompanied by flags indicating that they are the first and last pixel commands in the batch, respectively. The flags are used by downstream hardware to facilitate the process of activating and deactivating pages in frame buffer memory devices.