1. Field of the Invention
The invention relates generally to compiler systems and, more specifically, to a method for transforming a multithreaded program for general execution.
2. Description of the Related Art
Certain computer systems include a parallel processing subsystem that may be configured to concurrently execute plural program threads that are instantiated from a common program. Such systems are referred to in the art as having single program multi-data (SPMD) parallelism. CUDA is a programming model known in the art that implements SPMD execution on parallel processing subsystems. An application program written for CUDA may include sequential C language programming statements, and calls to a specialized application programming interface (API) used for configuring and managing parallel execution of program threads. A function within a CUDA application that is destined for concurrent execution on a parallel processing subsystem is referred to as a “kernel” function. An instance of the kernel is referred to as a thread, and a set of concurrently executing threads are organized as a thread block. A set of thread blocks may further be organized into a grid. Each thread is identified by an implicitly defined set of index variables. Each thread may access their instance of the index variables and act independently with respect to other threads based on the index variables. For example, CUDA defines a 3-tuple of index variables for thread position within a block, and a 2-tuple of index variables for thread position within a grid.
Based on a specific set of index variables, a given thread may independently access memory or other system resources with variable latency, leading to certain threads advancing further in execution than other threads. However, certain algorithms require coherent state among different threads at certain synchronization points before processing may advance. To enable proper synchronization among threads, CUDA provides synchronization barriers, whereby if any thread calls a certain synchronization primitive, all threads within a related group of concurrent threads must call the same synchronization primitive before any thread may advance past the synchronization primitive. In this way, related threads at different stages of execution may synchronize their execution stage before advancing.
In certain scenarios a user may wish to execute an existing SPMD application, such as a CUDA application, on a general purpose central processing unit (CPU) rather than on a parallel processing subsystem. Unfortunately, conventional CPUs are typically configured to execute only a limited number of independent concurrent threads, and conventional operating systems that support execution of a larger number of threads typically map each thread to an independent process, requiring burdensome context switches to perform thread synchronization at synchronization barriers. Therefore, directly mapping threads for a CUDA program to a set of equivalent threads in a general purpose processing environment represents an unacceptably inefficient approach to executing a CUDA program on a general purpose CPU.
As the foregoing illustrates, what is needed in the art is a technique for efficiently executing an SPMD application on a general purpose CPU.