1. Field of the Invention
Embodiments of the present invention generally relate to multiple graphics processing unit (GPU) systems and more specifically to hierarchical memory addressing.
2. Description of Related Art
Commercial graphics processing unit (GPU) computation systems commonly configure a cluster of multiple GPU devices to operate in concert, for example, to solve a single problem. In such systems, each GPU device typically executes instructions to solve a portion of the problem and communicates intermediate results with other GPU devices as execution progresses. A local memory may be coupled to each GPU device for local program and data storage. Each local memory is conventionally accessed via an independent, local address space associated with the corresponding GPU. Each GPU may comprise multiple processing cores, and each core commonly implements a cache for efficient access to data that is relevant to an ongoing computation. Each local memory and each cache associated with a given GPU is conventionally configured to be exclusively accessed by the GPU. Each GPU may be configured to access a common system memory for communicating with a host central processing unit (CPU). The CPU may transmit data to the GPU via the system memory and receive data from the GPU via the system memory.
In a conventional cluster of multiple GPU devices, one GPU transmits data, such as intermediate results, to another GPU using a technique involving at least two copy operations and a temporary buffer in system memory. While technically feasible, this technique makes inefficient use of system resources such as bandwidth and memory. Furthermore, each transmitting GPU must execute programming instructions to bundle and transmit outbound data, which each receiving GPU must execute programming instructions to receive and unbundle the data. The overall process makes inefficient use of GPU resources, further reducing overall system efficiency. Additionally, each operation for transmitting a unit of data from one GPU to another GPU typically requires explicit programming instructions to be written by a developer, in an application development process that is inefficient with respect to developer time and attention.
As the foregoing illustrates, what is needed in the art is a technique that facilitates more efficient communication between GPU devices.