Field of the Invention
The present invention relates generally to compiler programs and more specifically optimization techniques during compilation of a program written for execution by a multi-threaded processor to be executed by a general purpose single-threaded processor.
Description of the Related Art
Modern graphics processing systems typically include a multi-core graphics processing unit (GPU) configured to execute applications in a multi-threaded manner. For example, NVIDIA's GeForce® 8 GPU has 128 processing cores, with each core having its own floating point unit (FPU) and a set of 1024 registers. Each cluster of 8 processing cores also has 16 KB of shared memory supporting parallel data access. Such an architecture is able to support up to 12,288 concurrent threads, with each thread having its own stack, registers (i.e., a subset of the 1024 registers in a processing core), program counter and local memory.
With such an architecture, multiple instances of a single program (or portions of a program) can be executed in parallel, with each instance having been allocated a thread on a cluster of processing cores such that the threads can simultaneously operate on the same data in shared local memory. NVIDIA's CUDA™ (Compute Unified Device Architecture) software platform provides a C programming language environment that enables developers to write programs that leverage the parallel processing capabilities of GPUs such as the GeForce 8.
Local variables of a CUDA program executing simultaneously on multiple threads of a thread block are handled within each thread's own stack, registers, and local memory while global variables of the program utilize the memory shared across the processing cores of a cluster. To avoid data races, where the simultaneously executing threads of the program block manipulate the same global variables in shared memory, developers writing programs with CUDA explicitly insert synchronization barrier instructions (using the _synchthreads ( ) API) throughout their code to partition the code into sections to enforce an ordering constraint on shared memory operations across the simultaneously executing threads. In particular, a CUDA compiler ensures that all shared memory references occurring prior to a synchronization barrier instruction are completed before any shared memory references after the synchronization barrier.
The execution of concurrent threads of a thread-block by a GPU can be simulated on conventional single-threaded general purpose central processing units (CPUs) by compiling or otherwise transforming a pre-existing CUDA program into a single sequence of instructions that can be sequentially executed by a single thread. Importantly, the single sequence of instructions simulates the simultaneous execution of instructions by concurrent threads of a thread-block. However such a transformation must address several memory access and parallel processing challenges. For example and in particular, to avoid data corruption issues, local variables for each concurrent thread of a thread-block (which would have had their own local memory locations in a thread running in a GPU environment) may need to be allocated their own memory locations within the framework of a single-threaded system.
As the foregoing illustrates, what is needed in the art is a method for efficiently allocating memory for local variables in a program intended to be executed in a multi-threaded environment to enable that program to be executed in a single-threaded environment.