1. Field of the Invention
Embodiments of the present invention relate generally to graphics systems and more specifically to a method and system for using a graphics processing unit (“GPU”) frame buffer in a multi-GPU system as cache memory.
2. Description of the Related Art
Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
With the increasing demand for realism and interactivity in graphics applications, some multi-GPU systems, such as the NVIDIA Quadro Plex visual computing systems, have been developed. These multi-GPU systems typically use a bus structure to connect the multiple GPUs. Each GPU is coupled with its own local frame buffer. However, the capacity of the local frame buffer usually becomes insufficient when a graphic-intense application occurs. For instance, texture maps that are needed in a rendering process of such a graphic-intense application often exceed such memory capacity. One convention approach is to use system memory to store any data that does not fit in the local frame buffer.
To illustrate, FIG. 1A is a simplified block diagram of a conventional system 100, which includes a central processing unit (“CPU”) 108, BIOS 110, system memory 102, and a chipset 112 that is directly coupled to a graphics subsystem 114. The system memory 102 further contains a graphic driver 104 and a memory block 106. The chipset 112 provides system interfaces to the CPU 108, the system memory 102, the graphics subsystem 114, and other peripheral devices not shown in the figure. The graphics subsystem 114 includes two graphics adapters 120 and 130. Each graphics adapter has a single GPU. A primary GPU 126 and a secondary GPU 132 are coupled to their own local frame buffers 128 and 134, respectively. The primary GPU 126 and the secondary GPU 132 are also coupled to the chipset 112 via communication links such as Peripheral Component Interface (“PCI”) Express.
When the local frame buffers 128 and 134 are full, if additional texture information needs to be stored, the conventional approach accesses the memory block 106 in the system memory 102 to store such texture information. Because the texture data is transported to or from the memory block 106 on the system bus of the system 100, one drawback of this approach is the polluting of the system bus. Specifically, if much of the system bus bandwidth is occupied with the texture data, then an undesirable latency is introduced to the delivery of other types of data, such as audio data. This latency forces the application needing this data, such as an audio playback application, to slowdown and thus negatively impacts its performance.
Another drawback of the conventional approach of using the memory block 106 to store texture data is the inefficiency of handling multiple texture requests contending to access the memory block 106. To illustrate, in conjunction with FIG. 1A, FIG. 1B shows a push (also commonly referred to as “blit”) operation performed by the secondary GPU 132 and a pull and blend operation performed by the primary GPU 126. Typically, before the primary GPU 126 can scan out its local frame buffer 128 to a display device 138 in block 158, the secondary GPU 132 transfers the output of block 152 into the memory block 106 in a push operation in block 154. The primary GPU 126 then needs to pull the data from the memory block 106 and blend the data with the content of its local frame buffer 128 in block 156. Here, because both the primary GPU 126 and the secondary GPU 132 access the same memory block 106, the primary GPU 126 needs to wait until the secondary GPU 132 completes its push operation before it can proceed with its pull and blend operation. In other words, the push operation and the pull and blend operation are forced to be synchronized and can only occur in sequence.
As the foregoing illustrates, what is needed in the art is a method and system for using GPU frame buffers as caches to reduce system memory accesses and addressing at least the shortcomings of the prior art approaches set forth above.