The desire to use a graphics processing unit (GPU) for general computation has become much more pronounced recently due to the GPU's performance per unit power and/or cost. GPUs, however, have traditionally operated in a constrained programming environment available primarily for the acceleration of graphics. With the advent of multi-vendor supported OpenCL.®. and DirectCompute.®., standard APIs and supporting tools, the limitations of the GPUs in traditional applications has been extended beyond traditional graphics. Although OpenCL and DirectCompute are a promising start, there are many hurdles remaining to creating an environment and ecosystem that allows the combination of a CPU and a GPU to be used as fluidly as the CPU for most programming tasks.
Existing computing systems often include multiple processing devices. For example, some computing systems include both a CPU and a GPU on separate chips (e.g., the CPU might be located on a motherboard and the GPU might be located on a graphics card) or in a single chip package. Both of these arrangements, however, still include significant challenges associated with (i) separate memory systems, (ii) efficient scheduling, (iii) programming model, (iv) compiling to multiple target instruction set architectures, and (v) providing quality of service (QoS) guarantees between processes, (ISAs)—all while minimizing power consumption.
In another example, computational inefficiencies may arise in existing computing systems when each of the threads in a wavefront, executing the same software program on a GPU, encounter conditional code. Some threads may evaluate the condition as true and other threads may evaluate the condition as false. When conditional sections of code become more complex or include loops with different iteration counts per thread, the threads in the wavefront diverge and may never reconverge, which can cause significant performance loss. In a worst case scenario, for example, a processor may execute only one thread at a time instead of a full wavefront. Thus, if the wavefront includes 64 threads, the processor operates at only 1/64 of peak performance.