Graphics systems typically use a frame buffer to store graphics data. One issue that arises in graphics processing is efficiently handling read-modify-write (RMW) requests.
Some of the problems associated with conventional RMW memory architectures may be understood by reference to FIG. 1. FIG. 1 illustrates a prior art graphics system 100. A graphics processing unit (GPU) 105 includes two or more different clients 110-A and 110-B. A memory controller 120 includes an arbiter 125 and a decompression module 130. A frame buffer 135 (e.g., DRAM memory) is configured to store graphics data as either compressed tiles 140 or as uncompressed tiles 145. The tiles may correspond to an integer number of atomic units of memory storage, i.e., the smallest unit of memory storage. An individual 128 B tile may, for example, be comprised of eight atomic units of 16 B each. Compression may, for example, be performed because of bandwidth limitations to reduce the data size that must be transferred over a memory bus 150. The compressed data may, for example, be encoded into one unit of 16 B, representing the entire tile. Compression bits may be stored on-chip to indicate whether a tile is compressed or uncompressed.
However, an individual client 110-B may be a “naïve” client that is not capable of independently performing compression/decompression. When naïve clients perform a read and the data is stored compressed in memory, the memory controller 120 decompresses the read data for the naïve client and returns it uncompressed. In the context of a RMW, when a naïve client makes a possible RMW write request, the memory controller determines if the existing data in memory is compressed, reads that compressed data, decompresses the data, writes out the entire tile to memory in an uncompressed format, before allowing the client to perform its write. In many applications a naïve client 110-B performs only a partial write of tile data. That is, naïve clients modify a small portion of the data in a compressed tile 140. If the naïve client overwrote the entire tile, there would be no need to perform a RMW operation even if the stored data were previously compressed.
Note that a RMW performed on behalf of a naïve client typically takes a significant number of clock cycles to complete due to DRAM write-to-read and read-to-write turnaround time. In another words, a RMW write for a naïve client takes a long time to complete compared to a simple write operation. A RMW operation for a naïve client thus results in accesses from other clients being blocked until the RMW is completed. As a result, RMWs increase the latency for other client reads. One technique in the prior art to address blocking issues was to, as much as possible, attempt to limit the possible number of RMW operations in flight. Another technique in the prior art to address RMW blocking issues was to include sufficient buffer capacity in individual clients to account for the increased read latency caused by RMWs. For example, for isochronous clients additional buffering can be included to account for the latency associated with blocking created by RMWs of other clients. However, providing additional buffering to account for RMW latency increases costs.
In light of the above-described problems the apparatus, system, and method of the present invention was developed.