Recent trends indicate significant increase in the use of GPUs (graphics processing units) for general-purpose computing (GPGPU). That is, GPUs are tending to be used for computing not necessarily related to computer graphics, such as physics simulation, video transcoding, and other data-parallel computing. Furthermore, the introduction of on-chip shared memory in GPUs has led to marked performance improvements for widely-used compute-intensive algorithms such as all-prefix sum (scan), histogram computation, convolution, Fast Fourier Transform (FFT), physics simulations, and more. Microsoft Corporation offers the Direct X™ HLSL (High Level Shading Language)™ Compute Shader as a software API (application programming interface) to access and utilize shared memory capabilities. Note that Direct X, the HLSL, and Compute Shader will be referred to as examples, with the understanding that comments and discussion directed thereto are equally applicable to other shading languages such as CUDA (Compute Unified Device Architecture), OpenCL (Open Compute Language), etc. These will be referred to generically as “compute shaders”.
A complete software platform should provide efficient software rasterization of a compute shader (or the like) on CPUs to provide a fallback when GPU hardware is not an option, or when the software platform is used in a headless VM (Virtual Machine) scenario, without the need to implement both GPU and CPU hardware solutions. That is, it is sometimes desirable to execute shader language code on a CPU rather than a GPU. However, mapping GPU-centric compute shaders onto CPUs efficiently is non-trivial primarily due to thread synchronization, which is enforced by thread barriers (or syncs).
While the efficiency of scalar shader code is important, discussion herein relates to efficiently mapping onto CPUs (as opposed to GPUs) the parallelism found in compute shaders. Compute shaders may expose parallelism in different ways. For example, the Direct Compute™ Dispatch call defines a grid of thread blocks to expose parallelism on a coarse level, which is trivial to map onto CPU threads. Each thread block is an instance of a compute shader program that is executed by multiple shader threads (a shader is analogous to a kernel in CUDA, for example). The shader threads of a block may share data via a shared memory that is common to threads in the block but private to the thread block. The threads of each thread block may be synchronized via barriers to enable accesses to shared memory without concern for data-race conditions arising. GPUs typically execute compute shaders via hardware thread-contexts, in groups of threads (warps or wave-fronts), and each context may legally execute the program until it encounters a barrier, at which point the context must wait for all other contexts to reach the same barrier. Hardware context switching in GPUs is fast and heavily pipelined. However, CPUs do not have such hardware support, which makes it difficult to efficiently execute compute shaders on CPUs.
Techniques discussed below relate to transforming a compute shader program into an equivalent CPU program that delivers acceptable performance on CPUs.