1. Field of the Invention
The present invention is generally directed to computing systems. More particularly, the present invention is directed to improving utilization of resources within a processor or computing system.
2. Background Art
The desire to use a graphics processing unit (GPU) for general computation has become much more pronounced recently due to the GPU's exemplary performance per unit power and/or cost. The computational capabilities for GPUs, generally, have grown at a rate exceeding that of the corresponding central processing unit (CPU) platforms. This growth, coupled with the explosion of the mobile computing market (e.g., notebooks, mobile smart phones, tablets, etc.) and its necessary supporting server/enterprise systems, has been used to provide a specified quality of desired user experience. Consequently, the combined use of CPUs and GP Us for executing workloads with data parallel content is becoming a volume technology.
However, GPUs have traditionally operated in a constrained programming environment, available primarily for the acceleration of graphics. These constraints arose from the fact that GPUs did not have as rich a programming ecosystem as CPUs. Their use, therefore, has been mostly limited to two dimensional (2D) and three dimensional (3D) graphics and a few leading edge multimedia applications, which are already accustomed to dealing with graphics and video application programming interfaces (APIs),
With the advent of multi-vendor supported OpenCL® and DirectCompute®, standard APIs and supporting tools, the limitations of the GPUs in traditional applications has been extended beyond traditional graphics. Although OpenCL and DirectCompute are a promising start, there are many hurdles remaining to creating an environment and ecosystem that allows the combination of a CPU and a GPU to be used as fluidly as the CPU for most programming tasks.
Existing computing systems often include multiple processing devices. For example, some computing systems include both a CPU and a GPU on separate chips (e.g., the CPU might be located on a motherboard and the GPU might be located on a graphics card) or in a single chip package. Both of these arrangements, however, still include significant challenges associated with (i) separate memory systems, (ii) efficient scheduling, (iii) providing quality of service (QoS) guarantees between processes, (iv) programming model, and (v) compiling to multiple target instruction set architectures (ISAs)—all while minimizing power consumption.
For example, the discrete chip arrangement forces system and software architects to utilize chip to chip interfaces for each processor to access memory. While these external interfaces (e.g., chip to chip) negatively affect memory latency and power consumption for cooperating heterogeneous processors, the separate memory systems (i.e., separate address spaces) and driver managed shared memory create overhead that becomes unacceptable for fine grain offload.
Attributes of both the discrete and single chip arrangements can also limit the types of workloads that will result in efficient execution on the GPU.
One example includes the limitations of single task launch systems. In this example, physics or artificial intelligence workloads with insufficient data parallel operations between serial reduction steps may not be able to overcome the communications overhead associated with execution on the GPU. This limitation can exist because the CPU can quickly require the results of the data parallel operations performed by the GPU. However, the high overhead of compute offload to the GPU (e.g., dispatching work to the GPU, the latency incurred in reporting results, and the possibility of having to wait for previously-issued work) often results in unacceptable performance. Unless offload overheads can be reduced, or multiple offloads can be concurrently scheduled, loss of the opportunity to accelerate fine grain data parallel workloads will occur.
in another example of single task launch system limitations, consider a workload that requires heavy register resources and also has a long executing kernel. Here, the large register usage creates sparse occupancy of the GPU compute units and, therefore, limits the latency hiding attributes the GPU normally offers. In current systems, workloads that require heavy register resources are typically larger than the average workload. These larger workloads can prevent the cooperative mixing of a different kernel with small resource requirements. In yet one other example, a low latency kernel can also fail to fully occupy the GPU's computing resources. In this scenario, some workgroups finish before all the workgroup slots can be occupied, leaving the GPU only partially utilized. Unless at least these limitations of single task launch systems can be overcome, the GPU will suffer from the inability to hide the latency of such bulky workloads.