Field of the Invention
Embodiments of the present invention generally relate to general purpose computing and, more specifically, to techniques for sharing priorities between streams of work and dynamic parallelism.
Description of the Related Art
A typical parallel processing subsystem, that may include one or more graphics processing units (GPUs), is capable of very high performance using a relatively large number of small, parallel execution threads on dedicated programmable hardware processing units. The specialized design of such parallel processing subsystems usually allows these subsystems to efficiently perform certain tasks, such as rendering 3-D scenes or computing the product of two matrices, using a high volume of concurrent computational and memory operations.
To fully realize the processing capabilities of advanced parallel processing subsystems, subsystem functionality may be exposed to application developers through one or more application programming interfaces (APIs) of calls and libraries. Among other things, doing so enables application developers to tailor a software application executing on a central processing unit (CPU) to optimize the way parallel processing subsystems function. In one approach to developing a software application, the software application developer may divide work included in the software application into streams of work components (e.g., computational and memory operations). Each stream may be executed concurrently on the parallel processing subsystem. More specifically, work components included in different streams may run concurrently and may be interleaved. In contrast, within each stream, a sequence of work components executes in issue-order on the parallel processing subsystem.
The parallel processing subsystem may schedule the execution of the work components using a variety of techniques depending on the functionality included in the parallel processing subsystem. Two features that may be included in advanced parallel processing systems are support for prioritizing work components and preemption of currently executing computation work components. For example, a parallel processing subsystem that supports prioritization may be configured to schedule work components in priority-order. And a preemption-capable parallel processing subsystem may be configured to preempt a lower-priority computational work component executing on a parallel processing subsystem resource in favor of a higher-priority computational work component.
Typically, a parallel processing subsystem that includes prioritization functionality supports a limited set of priorities—referred to herein as a “set of valid device priorities.” In one approach to exposing prioritization capabilities, an API included in a software stack enables the software application to assign a desired stream priority to a stream. An API driver (also included in the software stack) then maps the desired stream priority to a device priority included in the set of valid device priorities. Further, the API driver may store the device priority in a memory resource associated with the stream. Subsequently, if the software application requests the launch of a work component within the stream, then the API driver may request that the parallel processing subsystem launch the work component with the device priority associated with the stream.
Advanced parallel processing subsystems may also support dynamic parallelism. Dynamic parallelism allows a “parent” work component executing on the parallel processing subsystem to launch a “child” work component on the parallel processing subsystem. The parallel processing subsystem may also enable to “parent” work component to optionally synchronize on the completion of the “child” work component. Further, the parallel processing subsystem may enable the “parent” work component to consume the output produced from the “child” work component. In some implementations, the parallel processing subsystem performs the launching, synchronization, and consumption of the results of a “child” work component without involving the CPU.
Some parallel processing subsystems support multiple levels of nested “child” launches, where each subordinate launch executes at a new level. In other words, the “parent” work component executing at a first level may launch a first “child” work component. The first “child” work component executing at a second level may then launch a second “child” work component. The second “child” work component executing at a third level may then launch a third “child” work component, and so on. Because of resource limitations, such as the memory required by the parallel processing system to support each new level, the parallel processing subsystem will typically define a max nesting depth (N). Notably, the max nesting depth N is the maximum number of work components in the chain (and, therefore the maximum number of levels). For example, if a “parent” work component launches a first “child” work component and then the “first” child work component launches a “second” work component, then the nesting depth would be three (N=3). Any launch of a work component which would result in a “child” kernel executing at a deeper level than the maximum nesting depth will fail.
Dynamic parallelism usually requires that a “parent” work component is able to synchronize on any “child” work components that the “parent” work component launches. However, if executing the “parent” work component were to completely occupy the parallel processing subsystem resources, then the “child” work component would be unable to execute. Consequently, the “parent” work component would be unable to synchronize on the “child” work component. To avoid synchronization problems, the parallel processing subsystem is typically configured to ensure that the “child” work component receives enough resources to fully execute. In particular, the parallel processing subsystem is typically configured to give preference to the “child” work component whenever there is a resource contention between the “child” work component and the “parent” work component.
To ensure preferential treatment for the “child” work component, the parallel processing subsystem typically uses on or more of the valid device priorities mentioned above. More specifically, the parallel processing subsystem assigns a “child” work component a device priority that is one higher than the “parent” work component. Because each “child” work component may also launch nested “child” work components, to support a max nesting depth of N, the parallel processing subsystem requires (N−1) valid device priorities for child work components.
In one approach to accommodating dynamic parallelism in conjunction with prioritizing streams, the API driver reserves (N−1) valid device priorities to support a fixed maximum nesting depth of N. Further, the API driver is configured to disregard the device priority associated with a stream when launching a “parent” work component within the stream. In particular, upon receiving a request to launch a “parent” work component, the API driver launches the “parent” work component at the lowest valid device priority.
One drawback to this approach to prioritizing work is that by reserving (N−1) valid device priorities, the number of valid device priorities available for prioritizing streams is reduced by (N−1). And reducing the number of device priorities available for prioritizing streams may reduce the ability of application developers to optimize the performance of software applications. For example, to tune a software algorithm that performs video decoding and encoding using a pipelined workflow with M stages, an application developer could strategically allocate the work components for each stage into M prioritized streams. More specifically, to reduce the likelihood that a particular frame is starved for resources by subsequent frames, the second stage could be prioritized higher than the first stage, the third stage could be prioritized higher than the second stage, and so on. If a parallel processing subsystem were to support less than M device priorities for prioritizing streams, then the parallel processing subsystem would be unable to fully support the stream prioritization requests included in the software application. Consequently, the overall latency of each frame could be increased and, therefore, could be more likely to cause jitter in frame rates than in a parallel processing subsystem that supported M device priorities for streams.
Another drawback to the above approach to prioritizing work is that indiscriminately launching all “parent” work components at the default device priority could adversely affect latency-sensitive work components included in a high priority stream. For instance, suppose that a stream “StreamA” associated with a high device priority were to include a work component “ChildlessA” and a work component “ParentA.” Further, suppose that the work component “ParentA” were configured to launch a “child” work component “ChildA.” Finally, suppose that the work component “ChildlessA” were not configured to launch any “child” work components. The API driver would launch the work component “ChildlessA” at the high device priority associated with the stream and the work component “ParentA” at the lowest device priority. Subsequently, the parallel processing subsystem would launch the “ChildA” work component at a device priority reserved for “child” priorities (i.e., one higher than the lowest device priority). Consequently, if “ParentA” were a highly latency-sensitive work component, then executing “ParentA” at the lowest device priority could increase latency and reduce the execution speed of the software application.
As the foregoing illustrates, what is needed in the art is a more effective way to prioritize work submitted to parallel processing subsystems that support dynamic parallelism.