Field of the Invention
Embodiments of the present invention relate generally to computer processing and, more specifically, to tree-based thread management.
Description of the Related Art
In conventional computer processing systems, to execute a program within a particular processing device, a compiler first translates an associated software application text file into an optimized sequence of machine instructions. Typically, the software application text file is written in a general purpose programming language (e.g., C++). And the machine instructions are targeted to the selected processing device. In particular, the machine instructions may be targeted toward a parallel processing unit (PPU) that is capable of very high performance using a relatively large number of small, parallel execution threads on dedicated programmable hardware processing engines.
In some PPUs, single-instruction, multiple-thread (SIMT) techniques are used to support parallel execution of a large number of generally synchronized threads, using a common instruction unit configured to issue instructions to a set of processing engines within the PPU. In operation, the PPU may internally organize such threads into groups of related threads known as “thread groups” or “warps.” Each thread in a warp concurrently executes the same program on different data and is assigned to a different processing engine. Individual threads included in a SIMT warp begin executing at the same program address. However conditional instructions in the program may cause different threads within a warp to follow divergent execution paths.
Many PPUs manage the execution of the threads at the granularity of the warp, using a hardware-based call return stack (CRS) in a push-pop manner. Consequently, each warp is associated with a single active program counter and only one path within a warp executes at a time—any divergent paths are buried in the CRS. In particular, the PPU serializes the execution of divergent paths across each warp, disabling the threads that are not included in the currently executing path. If a particular instruction is embedded within a divergent path, then only the threads that are active during the execution of the divergent path will encounter the instruction. After all the paths have finished sequentially executing, the threads converge to a single execution path.
One limitation to this approach to thread management is that serializing divergent paths may distort the semantics of the original software application text file and produce undesirable consequences. For instance, some instructions (e.g., barrier, spinlock, etc.) control program execution flow based on a specified condition that is evaluated per-warp for one or more warps. Since the PPU is configured to execute only a single path per warp, such instructions typically gate forward progress through the program for all of the threads in the warp. In particular, upon execution by a particular thread, the instruction may be configured to evaluate the specified condition against all of the threads included in the warp. In such a scenario, certain threads may never receive the opportunity to satisfy the condition and the program may deadlock.
Accordingly, what is needed in the art is a more effective technique to manage threads and groups of threads in parallel architectures.