1. Field of the Invention
The present invention is generally directed to computing systems. More particularly, the present invention is directed to improving the execution of code by a processor.
2. Background Art
The desire to use a graphics processing unit (GPU) for general computation has become much more pronounced recently due to the GPU's exemplary performance per unit power and/or cost. The computational capabilities for GPUs, generally, have grown at a rate exceeding that of the corresponding central processing unit (CPU) platforms. This growth, coupled with the explosion of the mobile computing market (e.g., notebooks, mobile smart phones, tablets, etc.) and its necessary supporting server/enterprise systems, has been used to provide a specified quality of desired user experience. Consequently, the combined use of CPUs and GPUs for executing workloads with data parallel content is becoming a volume technology.
However, GPUs have traditionally operated in a constrained programming environment, available primarily for the acceleration of graphics. These constraints arose from the fact that GPUs did not have as rich a programming ecosystem as CPUs. Their use, therefore, has been mostly limited to two dimensional (2D) and three dimensional (3D) graphics and a few leading edge multimedia applications, which are already accustomed to dealing with graphics and video application programming interfaces (APIs).
With the advent of multi-vendor supported OpenCL® and DirectCompute®, standard APIs and supporting tools, the limitations of the GPUs in traditional applications has been extended beyond traditional graphics. Although OpenCL and DirectCompute are a promising start, there are many hurdles remaining to creating an environment and ecosystem that allows the combination of a CPU and a GPU to be used as fluidly as the CPU for most programming tasks.
Existing computing systems often include multiple processing devices. For example, some computing systems include both a CPU and a GPU on separate chips (e.g., the CPU might be located on a motherboard and the GPU might be located on a graphics card) or in a single chip package. Both of these arrangements, however, still include significant challenges associated with (i) separate memory systems, (ii) efficient scheduling, (iii) programming model, (iv) compiling to multiple target instruction set architectures, and (v) providing quality of service (QoS) guarantees between processes, (ISAs)—all while minimizing power consumption.
For example, since processes cannot be efficiently identified and/or preempted in existing computing systems, a rogue process can occupy the GPU hardware for arbitrary amounts of time. This diminishes the user's QoS.
In other cases, the ability to context switch off of the hardware is severely constrained—occurring at very coarse granularity and only at a very limited set of points in a program's execution. This constraint exists because saving the necessary architectural and microarchitectural states for restoring and resuming a process is not supported. Lack of support for precise exceptions prevents a faulted job from being context switched out and restored at a later point, resulting in lower hardware usage as the faulted threads occupy hardware resources and which sit idle during fault handling.
In another example, computational inefficiencies can arise in existing computing systems when each thread in a wavefront, executing the same program on a GPU, encounters conditional code, as some threads may evaluate the condition as true, while others may evaluate the condition as false. When conditional sections of code become more complex or include loops with different iteration counts per thread, the threads in the wavefront diverge and may never reconverge, which can cause significant performance loss. In a worst case scenario, for example, a processor may execute only one thread at a time instead of a full wavefront. Thus, if the wavefront includes 64 threads, the processor can operate at only 1/64 of peak performance.
Other attempts at addressing this problem have been prohibitively expensive, computationally inefficient, and/or overly complex.