As those skilled in the pertinent art are aware, applications, or programs, may be executed in parallel to increase their performance. Data parallel programs carry out the same process concurrently on different data. Task parallel programs carry out different processes concurrently on the same data. Static parallel programs are programs having a degree of parallelism that can be determined before they execute. In contrast, the parallelism achievable by dynamic parallel programs can only be determined as they are executing. Whether the program is data or task parallel, or static or dynamic parallel, it may be executed in a pipeline which is often the case for graphics programs.
A SIMT processor is particularly adept at executing data parallel programs. A control unit in the SIMT processor creates groups of threads of execution and schedules them for execution, during which all threads in the group execute the same instruction concurrently. In one particular processor, each group, or “warp,” has 32 threads, corresponding to 32 execution pipelines, or lanes, in the SIMT processor.
A fork-join data parallel program starts with a single-threaded main program. The program is in a sequential phase or region at this stage. At some point during the execution of the main program, the main, or “master,” thread encounters a sequence of parallel phases or regions. Each parallel region has independent data set and can be executed by multiple threads concurrently. The number of concurrent tasks in each parallel region is determined when the parallel region starts and does not change during the parallel region. When a parallel region is encountered, the main thread forks a team of threads (called worker threads) to execute the parallel region in parallel. The program then enters the parallel region. If a worker thread encounters a new parallel region, the new parallel region will be serialized, i.e. the parallel region will be executed by the encountering worker thread itself. The master thread waits until the parallel region finishes. Upon exiting the parallel region, the worker threads join with the master thread, which then resumes the execution of the main program, at which point the program enters a sequential region.
Table 1, below, sets forth an example of a fork-join data parallel program.
TABLE 1An Example of a Fork-join Data Parallel Programextern ext( );thread_main( ){  foo( );  ext( );  #pragma parallel loop  for (...) {    foo( );    bar( );  }  #pragma parallel loop  for (...) {    ext( );}  bar( );}bar( ){  #pragma parallel loop  for (...) {    ...  }}foo( ){  ...}
For purposes of understanding Table 1 and the remainder of this disclosure, the terms “foo” and “bar” are arbitrary names of functions. Any function can therefore be substituted for “foo” or “bar.”
The fork-join data parallel model is commonly used in parallel programming. For example, the OpenMP standard adopts this model as its basic thread execution model. The OpenACC standard uses this model for the worker threads in a group called a “gang.”