Recent trends indicate significant increase in the use of GPUs (graphics processing units) for general-purpose computing (GPGPU). That is, GPUs are tending to be used for computing not necessarily related to computer graphics, such as physics simulation, video transcoding, and other data-parallel computing. Furthermore, the introduction of on-chip shared memory in GPUs has led to marked performance improvements for widely-used compute-intensive algorithms such as all-prefix sum (scan), histogram computation, convolution, Fast Fourier Transform (FFT), physics simulations, and more. Microsoft Corporation offers the Direct X™ HLSL (High Level Shading Language)™ Compute Shader as a software API (application programming interface) to access and utilize shared memory capabilities. Note that Direct X, the HLSL, and Compute Shader will be referred to as examples, with the understanding that comments and discussion directed thereto are equally applicable to other shading languages such as CUDA (Compute Unified Device Architecture), OpenCL (Open Compute Language), etc. These will be referred to generically as “compute shaders”.
A complete software platform should provide efficient software rasterization of a compute shader (or the like) on CPUs to provide a fallback when GPU hardware is not an option, or when the software platform is used in a headless VM (Virtual Machine) scenario, without the need to implement both GPU and CPU hardware solutions. That is, it is sometimes desirable to execute shader language code on a CPU rather than a GPU. However, mapping GPU-centric compute shaders onto CPUs efficiently is non-trivial primarily due to thread synchronization, which is enforced by thread barriers (or syncs).
To address this problem, techniques have been developed to partition a compute shader into maximal-size regions, called thread loops, thus allowing compute shader code to be mapped efficiently to CPUs despite the presence of thread barriers. For that technique, see commonly assigned U.S. patent application Ser. No. 13/398,798, titled “RASTERIZATION FOR COMPUTE SHADERS”, filed Feb. 16, 2012, and incorporated by reference herein. While thread loop transformations are helpful, thread loops may be subjected to optimizations that improve their efficiency when running on a CPU.
Techniques discussed below relate to optimizing thread loop configuration and execution.