1. Field of the Invention
The present invention generally relates to general purpose computing and, more specifically, to techniques for assigning priorities to memory copies.
2. Description of the Related Art
A typical parallel processing subsystem, that may include one or more graphics processing units (GPUs), is capable of very high performance using a relatively large number of small, parallel execution threads on dedicated programmable hardware processing units. The specialized design of such parallel processing subsystems usually allows these subsystems to efficiently perform certain tasks, such as rendering 3-D scenes or computing the product of two matrices, using a high volume of concurrent computational and memory operations
To fully realize the processing capabilities of advanced parallel processing subsystems, subsystem functionality may be exposed to application developers through one or more application programming interfaces (APIs) of calls and libraries. Among other things, doing so enables application developers to tailor a software application executing on a central processing unit (CPU) to optimize the way parallel processing subsystems function. In one approach to developing a software application, the software application developer may divide work included in the software application into streams of work components (e.g., computational and memory operations). Each stream may be executed concurrently on the parallel processing subsystem. Notably, work components included in different streams may run concurrently and may be interleaved. In contrast, within each stream, a sequence of work components executes in issue-order on the parallel processing subsystem.
Different types of parallel processing subsystem resources operate on different types of work components. For example, compute engines execute computational work components, and copy engines execute memory copies. Parallel processing subsystems are typically configured to receive work components via hardware channels, with each hardware channel dedicated to an appropriate type of work component. Acting as a liaison between the API and the host scheduler, an API driver aliases the work components submitted in each stream onto one or more available hardware channels. A host scheduler included in the parallel processing subsystem receives the work components conveyed through the hardware channels and, subsequently, schedules the work components to execute on appropriate resources. In particular, the API driver distributes memory copies included in various streams to copy hardware (HW) channels which are configured to convey memory copies to the host scheduler. Upon receiving memory copies via the copy HW channels, the host scheduler distributes the memory copies between one or more copy engines.
In one approach to scheduling memory copies, the host scheduler allocates discrete time slices to each copy HW channel. And the host scheduler executes the memory copies within each copy HW channel in issue-order. In other words, when executing memory copies included in a particular copy HW channel, the host scheduler selects the memory copy that was issued least recently. For example, suppose that the parallel processing subsystem were to include two copy HW channels and one copy engine. Further, suppose that the host scheduler were to initially direct the copy engine to begin executing the memory copies included in the first copy HW channel in issue-order. Finally, suppose that the copy engine were not able to complete all of the memory copies included in the first copy HW channel before the time slice of the first copy HW channel expired. When the time slice expired, the host scheduler would wait for any currently executing memory copy to complete and then begin executing memory copies included in the second copy HW channel (in issue-order). The host scheduler would continue to switch between the two copy HW channels in a similar manner.
One drawback to the above approach to scheduling memory copies is that executing memory copies in strict issue-order subject to the time-slice constraints of the copy HW channels may cause software applications to execute with unacceptable latency and throughput. More specifically, many software applications include multiple computational operations that can execute in parallel. These computational operations often have dependencies on memory copies which would, optimally, execute simultaneously. However, since the bandwidth to copy from system memory to GPU memory and from GPU memory to system memory is limited, one or more memory copies may experience undesirable delays. In particular, latency-sensitive computational operations may get blocked waiting for related memory copies to execute. For example, suppose that a software application were to be performing video decoding and encoding using a pipelined workflow. Further, suppose that the parallel processing subsystem were to include a single copy engine. Finally, suppose that the copy engine were to be occupied performing memory copies associated with the first few stages processing a fifth frame. The memory copies associated with the processing of a fourth frame by the last stage in the pipeline could be delayed. Consequently, the overall latency of the fourth frame could cause jitter in frame rates.
As the foregoing illustrates, what is needed in the art is a more effective technique for scheduling memory copies submitted to a parallel processing subsystems for processing.