1. Field of the Invention
The invention relates generally to compiler systems and, more specifically, to a method for convergence analysis based on thread variance analysis.
2. Description of the Related Art
Certain computer systems include a parallel processing subsystem that may be configured to concurrently execute plural program threads that are instantiated from a common program. Such systems are referred to in the art as having single instruction multiple thread (SIMT) parallelism. CUDA is a programming model known in the art that implements SIMT execution on parallel processing subsystems. An application program written for CUDA may include sequential C language programming statements, and calls to a specialized application programming interface (API) used for configuring and managing parallel execution of program threads. A function within a CUDA application that is destined for concurrent execution on a parallel processing subsystem is referred to as a “thread program” of “kernel.” An instance of a thread program is referred to as a thread, and a set of concurrently executing threads are organized as a thread block. A set of thread blocks may further be organized into a grid. Each thread is identified by an instance of an implicitly defined set of index variables configured to store thread identity information for the thread. Each thread may access their instance of the index variables and act independently with respect to other threads based on the thread identity information residing in the index variables.
One consequence of acting independently is that one set of threads may execute one branch of a conditional statement, while another set of threads executes a different branch of the same conditional statement. In such a scenario, the two different sets of threads execute divergent paths that need to converge at some point later during execution. Synchronization barrier operations in divergent portions of the thread program may lead to incorrect behavior, including deadlock. Conventional techniques for compiling thread programs are not able to detect divergent execution scenarios that may lead to incorrect execution behavior. Instead, conventional compilers depend on explicit source code directives and an assumption that a thread program design is correct by construction, an assumption that is sometimes not true. For example, a synchronization barrier may be executed in one branch of a conditional statement, but not in a different branch, preventing the synchronization barrier from ever unblocking and a related thread block from ever converging and completing. In scenarios where a divergence error such as this is present in the thread program design, the thread program may compile without error, but then function incorrectly at runtime.
In scenarios where a thread program design provides for correct operation, certain sections of the thread program may execute identically over an arbitrary number of threads. Such sections of the thread program are referred to as thread invariant, and produce identical results over an arbitrarily large thread block or number of thread blocks because each thread performs an identical sequence of computations on an identical set of inputs. Conventional compilers are not able to detect which sections of a thread program are thread invariant, and are therefore required to schedule all portions of the thread program to execute in parallel, leading to inefficient utilization of resources within the parallel processing subsystem.
As the foregoing illustrates, what is needed in the art is a technique for more efficiently managing execution divergence in thread programs.