1. Field of the Invention
Embodiments of the present invention relate generally to video processing and more specifically to utilizing memory interface bandwidth to connect multiple graphics processing units.
2. Description of the Related Art
Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
To meet the ever increasing demands for realism, quality, and real-time interactivity of displayed images and videos in graphics applications, one solution is to use multiple graphics processing units (“GPUs”) in a graphics system. FIG. 1 is a simplified block diagram of a graphics system, 100, that includes two graphics adapters, 102 and 112, each of which contains a GPU. Graphics adapter 102 includes GPU 104 and video memory 106, which further includes at least frame buffer 108 and texture memory 110. Frame buffer 108 contains data that are to be transmitted to the display device of graphics system 100, or otherwise referred to as “scanning out,” and texture memory 110 contains not only the texture map information but also texture render targets, each of which stores intermediate values to be contributed to the final image in the frame buffer. Similarly, graphics adapter 112 also has GPU 114, video memory 116, frame buffer 118, and texture memory 120. These two graphics adapters are connected to chipset 122, which supports Peripheral Component Interface (“PCI”) Express (“PCIe”), to access the system resources of graphics system 100.
Using the scalable link interface (“SLI”) technology developed by Nvidia Corporation of Santa Clara, Calif., the two graphics adapters in graphics system 100 typically take on processing tasks in a master and slave arrangement to produce a single output. In one implementation using the SLI technology, GPU 104 is further coupled to GPU 112 via video bridge 124. In a typical setup, video bridge 124 operates at up to 10 Gigabytes (“GB”) per second. The interfaces between the GPUs and the video memory modules, such as memory interfaces 126 and 128, operate at up to 100 Gigabytes (“GB”) per second. Graphics adapters 102 and 112 are PCIe-compliant devices that typically support 16 lanes, otherwise referred to as x16 graphics adapters. Each lane supports data rate of 500 Megabytes (“MB”) per second. Each of PCIe links 130 and 132 typically supports 8 lanes and thus operates at up to 4 GB per second. The significantly slower data rates of video bridge 124 and PCIe links 130 and 132 than memory interfaces 126 and 128 result in bottlenecks in graphics system 100 and reduce the overall throughput of the system.
Certain operations supported by the SLI technology further highlight these performance issues. They are: the split-frame rendering (“SFR”), the alternative-frame rendering (“AFR”), and the SLI anti-aliasing operations. Suppose GPU 104 is the master GPU in graphics system 100, and GPU 114 is the slave GPU. In the SFR mode, GPU 104 is responsible for the top section of the screen, and GPU 114 is responsible for the bottom section. Although each GPU in an SLI configuration generally maintains its own copy of all local resources without having to share with the other GPU, any change to a texture render target needs to be broadcasted to all other GPUs. The performance penalty related to this broadcast is further exacerbated by the discrepancy between the data rates of memory interfaces 126 and 128 (i.e., at approximately 100 GB/s) and whichever data transfer path (i.e., one path via video bridge 124 at approximately 10 GB/s or another path via chipset 122 at approximately 4 GB/s) the GPUs select for the broadcast.
In the AFR mode, GPU 104 works on the odd frames, and GPU 114 independently works on the even frames. Because of this alternating mechanism, the GPU responsible for a frame that is not self-contained pushes the necessary data to the other GPU to operate on the next frame. A self-contained frame is a frame that shares little to no data among the multiple GPUs in the AFR mode. In other words, if frame 1 incorporates certain texture render targets that are needed in the subsequent frame (in other words, frame 1 is not self-contained), then GPU 104 pushes the texture data to GPU 114 to operate on frame 2. Similar to the discussions above, the transferring of such texture data, especially via the relatively inefficient data transfer path, introduces overhead and reduces the throughput of graphics system 100.
In the SLI anti-aliasing mode, each GPU renders the same frame, except the two frames are offset by half of a pixel. Before GPU 104 scans out data from its frame buffer 108, GPU 114 pushes the results of its anti-aliasing operation from its frame buffer 118 to a temporary buffer in a direct memory access (“DMA”) copy operation. GPU 104 then needs to pull the data from the temporary buffer and combines the data with the content of its frame buffer 108. These two operations are commonly referred to as a “pull and blend” operation. In one implementation, the temporary buffer resides in video memory 106, otherwise denoted as temporary buffer 134 in FIG. 1. The push operation of GPU 114 to transfer data to temporary buffer 134 and the subsequent pull and blend operation involving the transferred data in temporary buffer 134 all involve the inefficient data transfer path as discussed above.
Furthermore, because graphics system 100 includes two identical graphics adapters, graphics system 100 is at least twice as costly as a single graphics adapter. However, a graphics application that runs on graphics system 100 only recognizes a graphics adapter and thus only one set of resources. In other words, suppose each of video memories 106 and 116 is of the size of 256 MB. The graphics application still only recognizes and utilizes 256 MB of video memory as opposed to 512 MB. The content of one video memory is duplicated across all the graphics adapters, especially the texture rendering targets, so that the GPUs can carry out the texturing operations locally. In short, the additional resources in a multi-GPU system like graphics system 100 are not fully utilized.
In addition, unless the load balancing between the two GPUs is done flawlessly, the GPUs are unlikely to complete their given tasks at the same time at all times. Also, the GPUs need to be synchronized so that the proper frame is displayed at the appropriate time. As has been shown, the inadequate utilization of the additional resources as discussed above and the overhead of performing load balancing, synchronization, and data exchanges via inefficient data transfer paths between the graphics adapters all contribute to reducing the overall throughput of graphics system 100 to be less than twice that of a single graphics adapter.
As the foregoing illustrates, what is needed is a way to further increase the overall throughput but reduce the cost for a multi-GPU system.