Field of the Invention
Embodiments of the present invention relate generally to systems with multiple processing entities and, more particularly, to managing copy operations in complex processor topologies.
Description of the Related Art
A conventional copy engine is a hardware unit that copies data from one location to another location. A graphics processing unit (GPU) may include several such copy engines, ones that are configured to “push” local data to external locations, and others that are configured to “pull” data from external locations into local memory.
For example, a GPU could include a first copy engine configured to copy data from the frame buffer within the GPU to system memory associated with a central processing unit (CPU). The GPU could also include a second copy engine configured to copy data from the system memory of the CPU into the frame buffer of the GPU. In another configuration, the GPU could include just one copy engine configured to “push” data to the CPU, and the CPU could include one copy engine configured to “push” data to the GPU. In other configurations, the GPU could include one or more copy engines, each configured to both “push” and “pull” data. In such configurations, if a copy engine were instructed to perform both types of copy operations, communication link bandwidth may not be efficiently utilized. Generally, a device driver executing on the GPU manages the operation of the copy engine(s) associated with the GPU.
In simple processor topologies such as that described above, the GPU and CPU are coupled together via a communication link, such as a peripheral component interconnect express (PCIe) link, for example. Each copy engine is assigned a dedicated channel of the communication link and configured to perform copy operations across that channel. With two copy engines assigned to two different channels and configured to “push” and “pull” data, respectively, the GPU can implement a bidirectional communication link with a CPU. The bandwidth with which copy operations may be performed across that link depends on the native speed of the communication link channels. In order to increase copy bandwidth across the communication link, the number of copy engines may be increased, and an additional communication link channel may be assigned to each additional copy engine.
For example, in the exemplary topology described above, the GPU could include two copy engines configured to “push” data to the CPU across two communication link channels, and two copy engines configured to “pull” data from the CPU across two additional communication link channels, thereby doubling the copy bandwidth compared to the previously described configuration. The device driver executing on the GPU would need to manage the copy operations performed by all four copy engines, and potentially load balance copy operations across the associated channels.
In a more complex processor topology, a CPU may be coupled to multiple GPUs that, in turn, may be coupled to one another, or to a single GPU that includes multiple processing entities that, in turn, may be coupled to one another. For example, the CPU could be coupled to two GPUs via PCIe links, while each GPU could be coupled to the other GPU via a chip-to-chip communication link, such as, e.g., an NVlink High Speed Interconnect. Each GPU in this configuration could include four copy engines—a first copy engine to “push” data to the CPU, a second cop engine to “pull” data from the CPU, a third copy engine to “push” data to the other GPU, and a fourth copy engine to “pull” data from the other GPU.
Other configurations of copy engines are possible in the exemplary topology described above. However, as a general matter, to support bidirectional copying between any two processors, at least two copy engines are needed. Further, to increase copy bandwidth between processors or processing entities, additional copy engines are needed and additional communication link channels must be assigned to those additional copy engines. The corresponding device driver must manage the additional copy engines and load balance copy operations across all relevant channels.
One drawback of the approach described above is that highly complex processor topologies are becoming increasingly common, but sufficient copy engines cannot be included within each processor to support high-bandwidth copy operations between neighboring processors. For example, multiple CPUs could be coupled to vast arrays of interconnected GPUs. Using the above approach, each processor would need a different copy engine for each channel across for which copy operations are to be performed, potentially requiring an inordinate number of copy engines. Additionally, copy engines are hardware units, and processors generally cannot include more than a handful of such units without increasing the size of the processors beyond acceptable limits. Consequently, the complexity of processor topologies can be substantially limited by the inability to include sufficient numbers of copy engines in processors.
Another drawback of the above approach is that, because the device driver on each GPU must manage all copy engines in the corresponding GPU, the device driver executing on a given GPU must load balance copy operations across the various copy engines. That load balancing must occur in a manner that depends on the unique processor topology. For example, if a GPU is coupled to one neighboring GPU via four channels of a communication link and coupled to another neighboring GPU by six channels of a communication link, the driver must account for these link width differences when load-balancing copy operations that involve those neighboring processors. Since GPUs may be coupled together according to a wide variety of different topologies with widely varying link widths, the driver must be preprogrammed to account for all such potential topologies, many of which can be highly complex. If the driver is not preprogrammed to handle a specific topology, then copying functionality may be limited or unavailable for that topology. Consequently, the driver must be exceedingly complex. As is well known, highly complex software inevitably creates maintainability, portability, and scalability issues.
As the foregoing illustrates, what is needed in the art is a more effective approach for managing copy operations in complex processor topologies.