1. Field of the Invention
Embodiments of the present invention relate generally to parallel processing and more specifically to a system and method for context migration across CPU threads.
2. Description of the Related Art
Typical parallel processing subsystems include at least one parallel processing unit (PPU) that may be configured to beneficially provide a high volume of computational throughput that is impractical to achieve with a single processing unit. The PPU may be configured to incorporate a plurality of processing cores, each capable of executing one or more instance of a parallel program on a plurality of processing engines. Each executing instance of the parallel program, called a PPU thread, or simply “thread,” typically computes a portion of the required overall results.
In conventional application execution models, a user application may employ PPU threads to perform a portion of the computations required by the user application. A user application commonly includes one or more central processing unit (CPU) threads executing on the CPU, where each CPU thread that employs a PPU to perform computations must include an associated PPU context that is specifically coupled to the CPU thread. The PPU context may remain associated with the related CPU thread for the lifetime of the CPU thread, and may not coexist with a second PPU context associated with the same CPU thread. In other words, a one-to-one coupling is created between the CPU thread and the PPU context. This type of conventional execution model has certain benefits. One benefit is that CPU threads may efficiently refer to and access data in a related PPU context through the simple one-to-one CPU thread to PPU context association. In some well known operating systems, the PPU context may reside in thread-local storage, providing an efficient, well-understood access methodology. An additional benefit arises because only one CPU thread may access a given PPU context at a time, thereby eliminating a synchronization step and improving overall performance.
However, the conventional execution model also creates certain important inefficiencies. One important inefficiency in the conventional execution model is that an application that may benefit from incorporating multiple PPU contexts is required to incorporate and execute an additional CPU thread per additional PPU context. Each additional thread requires additional memory and execution overhead, and introduces additional programming complexity. Furthermore, under the conventional execution model, libraries that call the PPU to perform computations on behalf of user applications are required to create a “worker thread” per additional PPU context and delegate any PPU computation from the user application to the worker thread for processing. This delegation process increases application complexity. In both cases, overall performance suffers due to the additional execution overhead.
As the foregoing illustrates, what is needed in the art is a technique for efficiently managing PPU contexts within multi-threaded parallel processing systems.