As those skilled in the pertinent art are aware, applications may be executed in parallel to increase their performance. Data parallel applications carry out the same process concurrently on different data. Task parallel applications carry out different processes concurrently on the same data. Static parallel applications are applications having a degree of parallelism that can be determined before they execute. In contrast, the parallelism achievable by dynamic parallel applications can only be determined as they are executing. Whether the application is data or task parallel, or static or dynamic parallel, it may be executed in a pipeline which is often the case for graphics applications.
Certain computing systems, such as a single-instruction, multiple-data (SIMD) processor, are particularly adept at executing data parallel applications. A pipeline control unit in the SIMD processor creates groups of threads of execution and schedules them for execution, during which all threads in the group execute the same instruction concurrently. In one particular processor, each group has 32 threads, corresponding to 32 execution pipelines, or lanes, in the SIMD processor.
Consider a fork-join parallel programming model such as OpenMP or OpenACC implemented on a parallel processing computing system. In this model, some parts of a program's code are executed by only one thread (a “master” thread) while other parts are executed by multiple threads in parallel (“worker” threads). Execution starts with only the master thread active. At a work creation construct, execution is forked when the master thread activates worker threads and assigns each worker an “execution task,” such as a certain number of iterations of a loop. Worker threads then typically execute their assigned tasks in parallel. Once the worker threads are finished, they deactivate, and execution is joined when the master thread resumes execution of the remainder of the program code. The period of program execution when only one thread is active will be referred to herein as the sequential region or phase, and the period of program execution when more than one thread is active will be referred to herein as the parallel region or phase.
In many fork-join models, including OpenMP and OpenACC, data objects allocated in the sequential region can be accessed in the parallel region. Accordingly, parallel processor architectures provide memory for storing the data objects to which multiple threads may gain access during their execution. This memory may be characterized by many properties, including size, latency, volatility and others and their accompanying advantages and disadvantages.