Field of the Invention
The invention generally relates to scheduling and, more specifically, to techniques for improving performance in multi-threaded processing units.
Description of the Related Art
Typically, computer systems share hardware between multiple applications. In a central processing unit (CPU), an application is executed until an event causes a context switch to a different application. A context switch may be triggered due to certain operating conditions, such as reaching a particular point in the execution flow that requires some action by a different process or hardware unit, the expiration of a time quanta allocated to a process, or as a consequence of the system needing to run a higher-priority process (i.e., pre-emption).
There are several common approaches to implementing a context switch. In one implementation, the CPU may stop processing at an instruction boundary, push context state onto a stack, and load a different context. In a different approach implemented by graphics processing units (GPUs), the GPU stops issuing new instructions and waits for the current work to complete processing through a pipeline (i.e., Wait-For-Idle or WFI). In this approach, the GPU is not required to save any context state because the already issued work is allowed to complete processing. In another approach implemented by GPUs, the GPU may stop processing of all work currently in the pipeline, write the context state to memory, and load a different context.
Typically, GPU context state is much larger than CPU context state. In addition, GPUs typically implement longer pipelines that take hundreds or thousands of clock cycles to complete. Thus, in either of the latter two approaches set forth above, a large context switch latency penalty is introduced. In the first approach (the WFI approach), previously issued instructions are required to complete processing in the pipeline, which takes hundreds or thousands of clock cycles before new instructions associated with the different context may be issued. In the second approach, saving the large GPU context state may take hundreds or even thousands of clock cycles as well.
Accordingly, what is needed in the art is a more effective approach for executing multi-threaded operations on GPUs that have large graphics pipelines.