Current computer applications are more graphically intense and involve a higher degree of graphics processing power than their predecessors. Applications such as games typically involve complex and highly detailed graphics renderings that involve a substantial amount of ongoing computations. To match the demands made by consumers for increased graphics capabilities in computing applications, such as games, computer configurations have also changed.
As computers, particularly personal computers, have been programmed to handle ever-increasing demanding entertainment and multimedia applications, such as high definition video and the latest 3-D games, increasing demands have been placed on system bandwidth. To meet these changing requirements, methods have arisen to deliver the bandwidth needed for current bandwidth hungry applications, as well as providing additional headroom, or bandwidth, for future generations of applications.
This increase in bandwidth has been realized in recent years in the bus system of the computer's motherboard. A bus is comprised of conductors that are hardwired onto a printed circuit board that comprises the computer's motherboard. A bus may be typically split into two channels, one that transfers data and one that manages where the data has to be transferred. This internal bus system is designed to transmit data from any device connected to the computer to the processor and memory. where the data has to be transferred. This internal bus system is designed to transmit data from any device connected to the computer to the processor and memory.
One bus system is the PCI bus, which was designed to connect I/O (input/output) devices with the computer. PCI bus accomplished this connection by creating a link for such devices to a south bridge chip with a 32-bit bus running at 33 MHz.
The PCI bus was designed to operate at 33 MHz and therefore able to transfer 133 MB/s, which is recognized as the total bandwidth. While this bandwidth was sufficient for early applications that utilized the PCI bus, applications that have been released more recently have suffered in performance due to this relatively narrow bandwidth.
More recently, a new interface known as AGP, Advanced Graphics Port, was introduced for 3-D graphics applications. Graphics cards coupled to computers via an AGP 8X link realized bandwidths approximately at 2.1 GB/s, which was a substantial increase over the PCI bus described above.
Even more recently, a new type of bus has emerged with an even higher bandwidth over both PCI and AGP standards. A new standard, which is known as PCI Express, is typically known to operate at 2.5 GB/s, or 250 MB/s per lane in each direction, thereby providing a total bandwidth of 10 GB/s in a 20-lane configuration. PCI Express (which may be abbreviated herein as “PCIe”) architecture is a serial interconnect technology that is configured to maintain the pace with processor and memory advances. As stated above, bandwidths may be realized in the 2.5 GHz range using only 0.8 volts.
At least one advantage with PCI Express architecture is the flexible aspect of this technology, which enables scaling of speeds. When combining the links to form multiple lanes, PCIe links can support x1, x2, x4, x8, x12, x16, and x32 lane widths. Nevertheless, in many desktop applications, motherboards may be populated with a number of x1 lanes and/or one or even two x16 lanes for PCIe compatible graphics cards.
FIG. 1 is a nonlimiting exemplary diagram 10 of at least a portion of a computing system, as one of ordinary skill in the art would know. In this partial diagram of a computing system 10, a central processing unit, or CPU 12, may be coupled by a communication bus system, such as the PCIe bus described above. In this case, a north bridge chip 14 and south bridge chip 16 may be interconnected by various types of high-speed paths 18 and 20 with the CPU and each other in a communication bus bridge configuration.
As a nonlimiting example, one or more peripheral devices 22a-22d may be coupled to north bridge chip 14 via an individual pair of point-to-point data lanes, which may be configured as x1 communication paths 24a-24d, as described above. Likewise, a south bridge chip 16, as known in the art, may be coupled by one or more PCIe lanes 26a and 26b to peripheral devices 28a and 28b, respectively.
A graphics processing device 30 (which may hereinafter be referred to as GPU 30) may be coupled to the north bridge chip 14 via a PCIe 1×16 link 32, which essentially may be characterized as 16×1 PCIe links, as described above. Under this configuration, the 1×16 PCIe link 32 may be configured with a bandwidth of approximately 4 GB/s.
Even with the advent of PCIe communication paths and other high bandwidth links, graphics applications have still reached limits at times due to the processing capabilities of the processors on devices such as GPU 30 in FIG. 1. For that reason, computer manufacturers and graphics manufacturers have sought solutions that add a second graphics processing unit to the hardware configuration to further assist in the rendering of complicated graphics in applications such as 3-D games and high definition video, etc. However, in applications involving multiple GPUs, methods of inter-GPU communication have posed numerous problems for hardware designers.
FIG. 2 is an alternate embodiment computer 34 of the computer 10 of FIG. 1. In this nonlimiting example of FIG. 2, graphics processing operations are handled by both GPU 30 and GPU 36, which are coupled via PCIe links 33 and 38, respectively. As a nonlimiting example, each of PCIe links 33 and 38 may be configured as x8 links. However, in this nonlimiting example, GPUs 30 and 36 should be configured so as to communicate with each other so as not to duplicate efforts and to also handle all graphics processing operations in a timely manner.
Thus, in one nonlimiting application, GPU 30 and GPU 36 should be configured to operate in harmony with each other. In at least one nonlimiting example, as shown in FIG. 2, computer 34 may be configured such that GPUs 30 and 36 communicate with each other via system memory 42, which itself may be coupled to north bridge chip 14 via links 44 and 47, which may be x1 links, as similarly described above. In this configuration, GPU 30 may communicate with GPU 36 via link 33 to north bridge chip 14, which may forward communications to system memory via link 44. Communications may thereafter be routed back through north bridge chip 14 via communication path 47 and on to GPU 36 via x8 PCIe link 38. In this configuration, each of GPU 30 and 36 may share x8 PCIe bandwidth via links 33 and 38, thereby consuming some of the bandwidth that may otherwise be used for graphics rendering. Also, inter-GPU traffic may suffer long latency times in this nonlimiting example due to the routing through north bridge chip 14 and the system memory 42. Furthermore, this configuration may suffer from extra system memory traffic.
FIG. 3 is yet another nonlimiting approach for a computer 40 to support multiple GPUs 30 and 36, as described above. In this nonlimiting example, north bridge chip 14 may be configured to support GPU 30 and GPU 36 via an 8-lane PCIe link 33 and another 8-lane PCIe link 38 coupled to GPUs 30 and 36, respectively. In this nonlimiting example, north bridge chip 14 may be configured to support port-to-port communications between GPUs 30 and 36. To realize this configuration, north bridge chip 14 may be configured with an additional number of gates, thereby decreasing the performance of north bridge chip 14. Plus, inter-GPU traffic may suffer from medium to substantial latencies for communications that travel between GPU 30 and 36, respectively. Thus, this configuration for computer 40 is also not desirable and optimal.
Thus, there is a heretofore-unaddressed need to overcome the deficiencies and shortcomings described above.