As computer and other digital systems become more complex and more capable, methods and hardware to enhance the transfer of data between system components or elements continually evolve. Data to be transferred include signals representing data, commands, or any other signals. Speed and efficiency of data transfer is particularly critical in systems that run very data-intensive applications, such as graphics applications. In typical systems, graphics processing capability is provided as a part of the central processing unit (CPU) capability, or provided by a separate special purpose processor such as a graphics processing unit (GPU) that communicates with the CPU and assists in processing graphics data for applications such as video games, etc. One or more GPUs may be included in a system. In conventional multi-GPU systems, a bridged host interface (for example a peripheral component interface express (PCIe®)) interface must share bandwidth with host traffic. FIG. 1 is a block diagram of a prior art system 100 that includes a root processor 108 (such as a CPU), a bridge 106, and two endpoints EP0 104a and EP1 104b (such as GPU0 and GPU1). Each endpoint 104 is coupled to a respective memory component 102a and 102b. 
In order to take advantage of the existence of more than one GPU in a system, the multiple GPUs share processing tasks, which requires transfer of data between them. A challenge in achieving good performance scaling in a multi-GPU system is that there is potentially a lot of data that must go back and forth between the GPUs. In an example situation, the application calls for the drawing of a background or some other texture that requires a lot of time to draw. It would be desirable to copy the background or other texture to another GPU so that it can be used in the next frame or another part of the same frame. In system 100, the only path for data transfer between GPUs 104 is through respective paths 105a and 105b to the bridge 106. The path into the root 108 must carry the traffic from each GPU 104 to the root 108, as well as the traffic between the two GPUs 104. This can create a bottleneck because the physical wires are being shared.