Field of the Invention
The present invention generally relates to general purpose computing and, more specifically, to techniques for assigning priorities to streams of work.
Description of the Related Art
A typical parallel processing subsystem, that may include one or more graphics processing units (GPUs), is capable of very high performance using a relatively large number of small, parallel execution threads on dedicated programmable hardware processing units. The specialized design of such parallel processing subsystems usually allows these subsystems to efficiently perform certain tasks, such as rendering 3-D scenes or computing the product of two matrices, using a high volume of concurrent computational and memory operations
To fully realize the processing capabilities of advanced parallel processing subsystems, subsystem functionality may be exposed to application developers through one or more application programming interfaces (APIs) of calls and libraries. Among other things, doing so enables application developers to tailor their software application to optimize the way parallel processing subsystems function. In one approach to developing a software application, the software application developer may implement an algorithm by dividing the work included in the algorithm into streams of work components (e.g., computational and memory operations) that may be executed in parallel on the parallel processing subsystem. Within each stream, a sequence of work components executes in issue-order on the parallel processing subsystem. In contrast, work components included in different streams may run concurrently and may be interleaved.
In one approach to scheduling work components, a scheduler within the parallel processing subsystem allocates parallel processing subsystem resources in discrete time slices to work components included in concurrent streams. When allocating a particular parallel processing subsystem resource, the scheduler typically selects the appropriate work component in issue-order. In other words, the scheduler selects the work component that was issued least recently from the set of work components that may be successfully performed using the resource. Further, if more than one appropriate parallel processing subsystem is available, the scheduler typically executes the work component using the appropriate parallel processing subsystem resource that has been least recently used.
One drawback to this approach is that some work components are more sensitive to latency than others. And the execution of work components in strict issue-order on the least recently used parallel processing subsystem resources may cause software applications that include latency-sensitive work components to execute with unacceptable latency and throughput. For example, if a software application is performing video decoding and encoding using a pipelined workflow and the first few stages in the pipeline are occupying most of the parallel processing subsystem resources processing a fifth frame, then the processing of a fourth frame by the last stage in the pipeline could be delayed. Consequently, the overall latency of the fourth frame could cause jitter in frame rates.
Another drawback to the above approach is that some software applications may be sensitive to execution order because they include inter-stream dependencies between work components requiring varying execution times. For example, a software application performing high-performance simulation of large molecular systems (e.g., NAMD) may use parallel molecular dynamics algorithms that include work components whose required execution times vary dramatically. Often, such algorithms divide the work into multiple streams with inter-dependencies. For example, a first stream could include “halo” work components whose results are required by “dependent” work components included a second stream. And the first stream could also include “internal” work components whose results are not required by work components included in any other stream. Further, the “halo” work components could require much shorter execution times than the “internal” work components. If the “internal” work components occupy most of the subsystem resources, then the “halo” work components could get stalled (i.e., blocked until any “internal” work components preceding the “halo” components complete). Because “dependent” work components included in the second stream require the results from “halo” work components included in the first stream, the second stream could be blocked until the blocking “internal” work components included in the first stream complete execution. Consequently, overall throughput of the software application could be adversely impacted.
As the foregoing illustrates, what is needed in the art is a more effective technique to schedule work submitted to parallel processing subsystems.