1. Field of the Invention
The present invention is generally directed to computing systems. More particularly, the present invention is directed to processing system calls from a graphics processing unit (GPU) to a central processing unit (CPU).
2. Background Art
The desire to use a GPU for general computation has become much more pronounced recently due to the GPU's exemplary performance per unit power and/or cost. The computational capabilities for GPUs, generally, have grown at a rate exceeding that of the corresponding CPU platforms. This growth, coupled with the explosion of the mobile computing market and its necessary supporting server/enterprise systems, has been used to provide a specified quality of desired user experience. Consequently, the combined use of CPUs and GPUs for executing workloads with data parallel content is becoming a volume technology.
However, GPUs have traditionally operated in a constrained programming environment, available only for the acceleration of graphics. These constraints arose from the fact that GPUs did not have as rich a programming ecosystem as CPUs. Their use, therefore, has been mostly limited to two-dimensional (2D) and three-dimensional (3D) graphics and a few leading edge multimedia applications, which are already accustomed to dealing with graphics and video application programming interfaces (APIs).
With the advent of multi-vendor supported OpenCL™ and DirectCompute®, standard APIs and supporting tools, the limitations of the GPUs in traditional applications has been extended beyond traditional graphics. Although OpenCL and DirectCompute are a promising start, there are many hurdles remaining to creating an environment and ecosystem that allows the combination of CPU and GPU to be used as fluidly as the CPU for most programming tasks.
Existing computing systems often include multiple processing devices. For example, some computing systems include both a CPU and a GPU on separate chips (e.g., the CPU might be located on a motherboard and the GPU might be located on a graphics card) or in a single chip package. Both of these arrangements, however, still include significant challenges associated with (i) separate memory systems, (ii) efficient scheduling, (iii) providing quality of service (QoS) guarantees between processes, (iv) programming model, and (v) compiling to multiple target instruction set architectures (ISAs)—all while minimizing power consumption.
For example, the discrete chip arrangement forces system and software architects to utilize chip to chip interfaces for each processor to access memory. While these external interfaces (e.g., chip to chip) negatively affect memory latency and power consumption for cooperating heterogeneous processors, the separate memory systems (i.e., separate address spaces) and driver managed shared memory create overhead that becomes unacceptable for fine grain offload.
In computing systems that include a CPU and a GPU, some commands cannot be executed on the GPU efficiently, and require the CPU's attention. In a conventional system, a GPU program is split into separate phases by the CPU with CPU communication occurring during phase transitions. The GPU executes identified programs during a phase. For example, the GPU executes a set of instructions in a first phase that were identified by the CPU to be executed by the GPU during the first phase. When the execution of the first phase is completed by the GPU, the CPU executes any commands required by the CPU during the phase transition, prior to the start of the next phase. After the CPU communication is completed, the GPU starts executing the commands for second phase and so on.
However, there are various problems associated with such a phased approach. The splitting of a GPU program into various phases is non-trivial and may not always be possible. There is also significant overhead associated with starting a phase which results in slowing down this approach. Further, work items that did not call the CPU still have to be launched since each phase needs to use the same work group layout, which results in additional processing to track status of the work items.