1. Field of the Invention
Embodiments of the present invention relate generally to the field of computing devices and more specifically to a technique for efficiently implementing a multi-chip Graphics Processing Unit.
2. Description of the Related Art
Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
To meet the ever increasing demands for realism, quality, and real-time interactivity of displayed images and videos in graphics applications, one solution is to use multiple graphics processing units (“GPUs”) in a graphics system. FIG. 1 illustrates a prior art graphics system 100 including two GPUs communicating over a PCIE bus. As shown, the graphics system 100 includes GPUs 102 and 106, frame buffers 104 and 108, and a PCIE bridge 110. The memory bus 112 couples the frame buffer 104 to the GPU 102, and the memory bus 118 couples the frame buffer 108 to the GPU 106. Each of the memory busses 112,118 provides an approximate bandwidth of 100 GB/s. PCIE busses 114,116 couple the PCIE bridge 110 to the GPUs 102,106, respectively.
Using the scalable link interface (“SLI”) technology developed by Nvidia Corporation of Santa Clara, Calif., the two GPUs in graphics system 100 typically take on processing tasks in a master and slave arrangement to produce a single output. The interfaces between the GPUs and the video memory modules, such as memory interfaces 112 and 118, operate at up to 100 Gigabytes (“GB”) per second. PCIE busses typically support 16 lanes and each lane supports a data rate of 500 Megabytes (“MB”) per second. Each PCIE bus 114,116 typically supports 16 lanes and thus operates at up to 8 GB per second.
Certain operations supported by the SLI technology further highlight these performance issues. They are: the split-frame rendering (“SFR”), the alternative-frame rendering (“AFR”), and the SLI anti-aliasing operations. Suppose GPU 102 is the master GPU in graphics system 100, and GPU 106 is the slave GPU. In the SFR mode, GPU 102 is responsible for the top section of the screen, and GPU 106 is responsible for the bottom section. Although each GPU in an SLI configuration generally maintains its own copy of all local resources without having to share with the other GPU, any change to a texture render target needs to be broadcast to all other GPUs. The performance penalty related to this broadcast is further exacerbated by the discrepancy between the data rates of memory interfaces 112,118 (i.e., at approximately 100 GB/s) and the PCIE busses 114,116 at approximately 4 GB/s).
In the AFR mode, GPU 102 works on the odd frames, and GPU 106 independently works on the even frames. Because of this alternating mechanism, the GPU responsible for a frame that is not self-contained pushes the necessary data to the other GPU to operate on the next frame. A self-contained frame is a frame that shares little to no data among the multiple GPUs in the AFR mode. In other words, if frame 1 incorporates certain texture render targets that are needed in the subsequent frame (in other words, frame 1 is not self-contained), then GPU 102 pushes the texture data to GPU 106 to operate on frame 2. Similar to the discussions above, the transferring of such texture data, especially via the relatively inefficient data transfer path, introduces overhead and reduces the throughput of graphics system 100.
In the SLI anti-aliasing mode, each GPU renders the same frame, except the two frames are offset by half of a pixel. Additional image quality improvements may be realized by further duplicating, offsetting and overlaying additional full screen images. The drawback with AASLI, however, is that little improvement in image quality is realized by using more than two GPUs, especially in light of the cost associated with each additional GPU. Also, with this technique, overall rendering speed is not necessarily increased.
Furthermore, unless the load balancing between the two GPUs is done flawlessly, the GPUs are unlikely to complete their given tasks at the same time at all times. Also, the GPUs need to be synchronized so that the proper frame is displayed at the appropriate time. As has been shown, the overhead of performing load balancing, synchronization, and data exchanges via inefficient data transfer paths between the graphics adapters all contribute to reducing the overall throughput of graphics system 100 to be less than twice that of a single graphics adapter.
Yet another drawback of these three SLI techniques is that they rely on the relatively slow PCIE bus for data transmissions between the two GPUs. This structure causes problems in situations where relatively large amounts of data need to be transmitted quickly between the two GPUs, such as when data rendered by one GPU is used as texture data by the other GPU. Because this data cannot be transmitted quickly enough over the PCIE bus, typical SLI systems are designed with frame buffers large enough to store the images being rendered by both GPUs so that these types of transmission can be avoided. Using larger frame buffers can increase the cost of a two GPU system by upwards of fifty percent.
As the foregoing illustrates, what is needed in the art is a multi-GPU graphics system that offers a better price-performance tradeoff than prior art graphics systems and is scalable beyond two GPUs to realize further performance advantages.