1. Field of the Invention
Embodiments of the present invention relate generally to parallel processing and more specifically to a system and method for dynamically spawning thread blocks within multi-threaded processing systems.
2. Description of the Related Art
Modern multi-processor systems, such as single-instruction multiple-data (SIMD) systems found in graphics processing units (GPUs), typically implement a relatively general programming model. The programming model conventionally includes a means for defining the specific programming instructions for a thread along with any system resources needed by the thread. One or more instances of the thread may execute concurrently in a thread block, where each thread may be allocated a portion of the overall workload through the use of a thread index. After the threads in a given thread block have completed execution, a process management subsystem is alerted and any further processing steps may be initiated by the process management subsystem. In some systems, the process management subsystem is part of a device driver. For example, after a first thread block has completed a first set of computations, the driver may be alerted to the completion of the first set of computations. The driver may subsequently spawn a second thread block to perform a second stage of computation that is based on the first set of computations.
Programming models for multi-processor, multi-threaded systems, such as SIMD GPU systems, commonly permit the use of general programming constructs, including conditional operators. A conditional operator guides the execution of a given thread at run-time to follow one out of two or more different paths through the programming instructions of the thread. Each path may include unique system resource requirements, such as a specific memory or register allocation. Conditional statements within the programming instructions of the thread are often computed dynamically, forcing the compiler to allocate sufficient system resources at compile time to satisfy the most expensive possible dynamic execution path within the thread. As a result, when a thread block is spawned, each thread within the thread block needs to be allocated sufficient resources for the most expensive possible execution path within the thread.
Certain common multi-threaded algorithms are characterized as having dramatically different system resource requirements for one conditional execution path compared to another conditional execution path. For example, iterative convergence algorithms may require substantial system resources to perform a large or complex iteration computation for non-converged regions, whereas negligible system resources are required for regions that have previously converged and require no further computation. An iterative convergence algorithm typically allocates specific regions of the problem space to specific threads within the multi-threaded system. A conditional statement within a given thread guides the execution of the thread to either perform the iteration computation if the thread is responsible for a non-converged region or to return if the thread is responsible for a converged region. The iteration computation may require substantial system resource, whereas the return path requires almost none. At run-time, however, every thread within the associated thread block needs to be spawned with sufficient system resources to perform the full iteration computation, leading to an inherent inefficiency in resource utilization.
Iterative convergence algorithms frequently converge much of the overall problem space quickly and tend to spend many iteration cycles attempting to converge small regions. This leads to a common scenario where much of the potential computational resources of a multi-processor system are actually allocated to threads that are not performing any useful work.
As the foregoing illustrates, what is needed in the art is a technique for more efficiently utilizing resources within a multi-threaded processing system.