The desire to use a graphics processing unit (GPU) for general computation has become much more pronounced recently due to the GPU's exemplary performance per unit power and/or cost. The computational capabilities for GPUs, generally, have grown at a rate exceeding that of the corresponding central processing unit (CPU) platforms. This growth, coupled with the explosion of the mobile computing market (e.g., notebooks, mobile smart phones, tablets, etc.) and its necessary supporting server/enterprise systems, has been used to provide a specified quality of desired user experience. Consequently, the combined use of CPUs and GPUs for executing workloads with data parallel content is becoming a volume technology.
However, GPUs have traditionally operated in a constrained programming environment, available primarily for the acceleration of graphics. These constraints arose from the fact that GPUs did not have as rich a programming ecosystem as CPUs. Their use, therefore, has been mostly limited to two dimensional (2D) and three dimensional (3D) graphics and a few leading edge multimedia applications, which are already accustomed to dealing with graphics and video application programming interfaces (APIs).
With the advent of multi-vendor supported OpenCL® and DirectCompute®, standard APIs and supporting tools, the limitations of the GPUs in traditional applications has been extended beyond traditional graphics. Although OpenCL and DirectCompute are a promising start, there are many hurdles remaining to creating an environment and ecosystem that allows the combination of a CPU and a GPU to be used as fluidly as the CPU for most programming tasks.
Existing computing systems often include multiple processing devices. For example, some computing systems include both a CPU and a GPU on separate chips (e.g., the CPU might be located on a motherboard and the GPU might be located on a graphics card) or in a single chip package. Both of these arrangements, however, still include significant challenges associated with (i) efficient scheduling, (ii) providing quality of service (QoS) guarantees between processes, (iii) programming model, (iv) compiling to multiple target instruction set architectures (ISAs), and (v) separate memory systems,—all while minimizing power consumption.
The existing multiprocessor computing systems often have insufficient memory coherency and, consequently, are unable to efficiently utilize the separate memory systems. For example, the CPU cannot effectively write to a block of memory and then access that memory from the GPU device unless the GPU explicitly synchronizes or flushes its caches. Otherwise, the write will not be made visible to the GPU device. This is because a GPU is optimized for a weak consistency memory model. In particular, load commands may be reordered after other load commands and store commands may be reordered after other store commands.
In addition, in existing multiprocessor computing systems the CPU is forced to statically split memory locations between two different memory heaps: one is private to the CPU private and the other is shared coherently with the CPU. As result of statically splitting memory locations between two memory heaps, existing multiprocessor computing systems are unable to efficiently utilize the separate memory systems.