1. Field of the Invention
The present invention generally relates to memory allocation and access and more specifically to sharing of memory by cooperating asymmetric coprocessors.
2. Description of the Related Art
Typical parallel processing subsystems include at least one parallel processing unit (PPU) that may be configured to beneficially provide a high volume of computational throughput that is impractical to achieve with a single processing unit, such as a conventional central processing unit (CPU). The PPU acts as a coprocessor to a CPU and may be configured to incorporate a plurality of processing cores, each capable of executing one or more instance of a parallel program on a plurality of processing engines, which collectively provide the high computational throughput. The PPU may be further configured to include one or more local memory subsystems, such as an external dynamic random access (DRAM) memory subsystem that is locally attached to the PPU and is locally controlled by the PPU.
In a typical scenario, a user application may employ a CPU and the PPU to each perform a portion of the computations required by the user application. The PPU commonly performs high volume computations, while the CPU performs more complex operations as well as over all housekeeping for the user application. The user application conventionally includes one or more CPU threads executing on a CPU, and a plurality of threads executing on one or more processing cores within the PPU that perform computations for the CPU threads. During the normal course of execution, the CPU threads generate data to be processed by the PPU and the PPU generates data to be stored, viewed by a user, or processed further by the CPU threads. For example, the CPU threads may receive image data from a video camera and transmit the image data to the PPU for processing. The PPU may then perform simple, but computationally intensive processing on the image data to generate a data set to be transmitted back to the CPU for more complex analysis. The data set may include candidate regions for edge detection, candidate regions for motion detection, or any other image features useful to the user application. In certain applications, the image data is being processed into the data set in real time and the latency associated with transferring the data set to the CPU is critical to overall system performance.
Conventional PPU systems typically provide a mechanism for mapping pages of the local PPU memory into a virtual address space associated with the CPU threads, enabling efficient data access of local PPU memory by CPU threads. Mapping the local PPU memory into the application space of a CPU thread, for example, allows the CPU threads to compose and efficiently transmit image data to the PPU for processing. In applications where data primarily flows from the CPU to the PPU, this mapping provides an acceptable performance optimization. However, in applications where PPU threads are composing and transmitting data to system memory associated with the CPU, this mapping is frequently inefficient because data bound for the CPU thread needs to be copied at least once while in transit. For example, when the PPU processes image data, the resulting data set must be generated in PPU memory because the PPU memory is the only virtual address space known to the PPU threads. Once complete, the data set may be copied to system memory, introducing inefficiencies by introducing additional processing latency.
The PPU memory is allocated to the PPU while all other memory in the system is referred to as host memory. This abstraction model allows for code portability across multiple memory configurations. For example, when NVIDIA's CUDA™ (Compute Unified Device Architecture) programming model is used, DRAM resident on a graphics card including a discrete graphics processing unit (GPU) that is configured as a coprocessor is the PPU memory and the system memory is the host memory. When an integrated GPU that does not include device memory is configured as a coprocessor, one portion of the system memory is designated as PPU memory for exclusive use by the PPU and is effectively removed from the system memory.
In a typical design paradigm, a CUDA™ application can be divided into three primary steps. During the initialization step the program first allocates buffers in both host and PPU memory and then begins copying inputs from host memory to PPU memory. Once the inputs have been successfully copied, the compute phase begins processing the input data using highly parallel algorithms on the GPU. After computation step has completed, the teardown step copies the resultants from PPU memory to host memory and presents the resultants to the application. Depending on the application, this copy-compute-copy cycle may repeat multiple times over the course of a single run. Optimizing this cycle is a major point of application tuning for most CUDA™ programs, as due to memory and interconnect bandwidths and latencies the copies of data to and from the compute device can frequently dominate the application runtime.
Accordingly, what is needed in the art is a system and method for reducing copying of data between memory allocated to the CPU and the PPU.