The present invention relates generally to processing memory requests in a computer system and in particular to methods and systems for efficiently retrieving data from memory for a graphics processing unit having multiple clients.
In current graphics processing systems, the number and processing speed of memory clients have increased enough to make memory access latency a barrier to achieving high performance. In some instances, various memory clients share a common memory, and each memory client issues requests for data stored in the common memory based on individual memory access requirements. Requests from these memory clients are typically serialized through a common interface. As a result, requests are sometimes queued up and processed on a first-in-first-out (FIFO) basis. This can result in slow inefficient processing of memory requests.
Since many computers are configured to use Dynamic Random Access Memory (DRAM) or synchronous DRAM (SDRAM), memory requests are also configured to retrieve data from these types of memories. DRAMs use a simple memory cell geometry that permits implementation of large memory arrays at minimum cost and power consumption on a single semiconductor chip. In a DRAM, all of the cells in a given group of memory locations, or a so-called “row,” are activated at the same time. Multiple read or write operations can thus be performed with various cells within the row, but only while it is active. If a new access is to be made to a different row, a precharge operation must be completed to close the presently active row then an activate operation must be performed to a different row. SDRAM, on the other hand uses a master clock signal to synchronously perform read/write accesses and refresh cycles. SDRAM arrays can also be split into two or more independent memory banks, and two or more rows can therefore be active simultaneously, with one open row per independent bank.
DRAM memory has much slower access times then SDRAM memory. The DRAM access time is slow because the switching speed within a conventional DRAM memory cell is not as fast as the switching speeds now common in central processing units (CPUs). As a result, when using high speed processors with conventional DRAMs, the processor must frequently wait for memory accesses to be completed. For example, delays equal to the precharge time and activate time are experienced whenever a different row must be accessed on a subsequent transaction. However, the precharge operation is only necessary if the row address changes; if the row address does not change on the subsequent access, the precharge operation has been unnecessarily executed and the device unnecessarily placed in an idle state.
SDRAM, on the other hand, may be accessed by multiple components such as a central processing unit (CPU), display refresh module, graphics unit, etc. Different components are given varying levels of priority based on the effect of latency on the component. For example, a display refresh module may be given a higher priority in accessing the SDRAM since any latency may result in easily-noticed, detrimental visual effects. If a computer system is designed to support interleaved accesses to multiple rows, SDRAMs make it possible to complete these accesses without intervening precharge and activate operations, provided that the rows to be accessed are all in separate SDRAM banks.
Regardless of whether DRAM or SDRAM is used, a command queue is used to pipeline requests from the clients requesting memory (i.e. a graphics display, texturing, rendering, etc.) to the memory controller and the memory. FIG. 1 illustrates a prior art pipeline for a computer system including N clients (client 1 105A, client 2 105B, . . . , client N 105N), a memory controller 110, an arbiter 115, a command queue 120, a look ahead structure 125, and a memory 130. In the prior art, the clients 105A through 105N determine when more data is needed and send individual requests to the memory controller 110 requesting that the memory controller 110 retrieve the specific data from the memory 130. The individual requests include the address, width and size of each array of data being requested. The memory controller 110 then uses the arbiter 115 to prioritize the requests and queues up those requests using command queue 120. Once the memory controller has queued up the individual memory requests, the look ahead structure 125 prefetches the requested data from the memory 130. The retrieved data is sent back to the clients 105A, . . . , 105N where it is stored in a respective client buffer until it is needed by the client 105A, . . . , 105N. The client 105A, . . . , 105N then processes the retrieved data.
Since memory controller 110 only uses one arbiter, the command queue 120 uses three pointers to process the memory request. The pointers include one pointer for precharging, one pointer for activating, and one pointer for reading/writing. Since there is only one arbitration point, there is less flexibility in managing DRAM bank state than with three arbiters (precharge, activate, read/write.) Moreover, if the client is isochronous, the command queue 120 can cause a bottleneck and increase read access time for the isochronous client. Many queued requests in the command queue take time to execute in the DRAM, thus adding to the isochronous client access time
Memory systems lacking command queues can couple the arbiters closely to the DRAM bank state. This allows better decision making when precharging and activating banks. Banks are not scheduled for precharge and activate until the bank is ready to accept the command. Delaying the arbitration decision allows later arriving clients to participate, resulting in a better arbitration decision.
Another problem can occur when multiple RMWs (read-modify-writes) occupy the command queue. Graphics chips utilizing frame buffer data compression in order to increase effective memory bandwidth can incur high RMW delay penalties when a compression unaware client writes over part of an existing compressed data tile in memory. The memory system must perform an RMW cycle comprised of a read, decompression, and write backs to the frame buffer. An RMW operation lasts ten of cycles, and multiple RMW requests queued in the command queue may substantially delay a subsequent isochronous read request.
For example, one problem with the prior art is that the serial nature of the FIFO command queue 120 can make it difficult for arbiter 115 to make selections avoid bank conflicts and therefore not waste clock cycles. Moreover, some commands can require long access time while other commands may have variable access times. It may be difficult for the arbiter to have knowledge of the number of DRAM cycles in the command queue due to compressed reads. As a consequence, in some applications it is difficult for arbiter 115 to make arbitration decisions that efficiently utilize memory 130, resulting in lost clock cycles and reduced performance. Another problem with the prior art, is latency introduced by command queue 120. Ideally, enough delay is introduced between the precharge, activate, and read/write commands to facilitate overlapping bank operations. However, too much delay adds latency to memory requests which requires more latency buffering in the clients, thus increasing chip area. Latency problems become more severe when several requests are in the command queue. These latencies can reduce performance by as much as ⅓.
A system without a command queue works well when there are many available clients requesting different DRAM banks. This allows the arbiter to interleave groups of client requests to the different DRAM banks and hide DRAM page management. When only a single client is active, all the traffic to different DRAM banks to hide DRAM page management must come from that one client.
Therefore, what is needed is a system and method for the client that allows the arbiter to look ahead in the client request stream in order to prepare DRAM banks by precharging and activating. With this system and method, the DRAM page management can be hidden behind read/write transfers, resulting in higher DRAM efficiency and lower read latency to the client. It is this look ahead mechanism that is the scope of this invention.