1. Technical Field
The present invention relates to graphics processing, and more particularly to energy-aware workload consolidation on a graphics processing unit (GPU).
2. Description of the Related Art
Heterogeneous clusters, which have diverse computing components to provide domain-specific accelerations, can provide substantial improvement in performance or energy-efficiency. Investigating how to optimize the usage of computing components is vital to leverage a heterogeneous cluster.
A general purpose graphics processing unit (GPGPU) is an important computing component for the state-of-art heterogeneous cluster. There are many reports about using a GPU to achieve great performance speedup for data parallel applications, such as support vector machine (SVM) training (172X) and the K-nearest neighbor algorithm (470X). However, some applications do not achieve good speedup on the GPU and only have comparable or even worse performance than a CPU. Enterprise workloads such as search, data mining and analytics, and so forth are examples of such workloads that typically involve a large number of users who are simultaneously using applications that are hosted on clusters of commodity computers. The execution time of these kernels is either comparable or in some cases more than the execution time on a multicore CPU. For example, GPU performance (NVIDIA Tesla C1060) of an encryption workload with an input file of 12 KB performs 18.33% worse than a CPU (Intel Xeon E5520 Quad-core). A sorting workload with 6144 input elements is only 3.69× faster on the GPU than on the CPU. The main reason for this poor performance on the GPU is due to low GPU hardware resource utilization. GPU computation power is thus not fully leveraged. The encryption workload mentioned above uses only 3 streaming multiprocessors (SMs). The sorting workload uses only 6 SMs. A typical GPGPU consumes more than 200 W at peak, about twice more than the peak power requirement of a CPU. Running those applications with low hardware utilization on power-hungry GPUs is not energy efficient.
There are other scenarios where GPU resources are unevenly utilized. For example, an encryption workload with an input file of 184 KB uses 45 thread blocks. These blocks could be unevenly distributed on a GPU, such as a Tesla C 1060 with 30 SMs. Lightly-loaded SMs finish earlier and have to wait for the remaining SMs to finish. These lightly-loaded SMs will stay active even while doing nothing on a modern GPU that does not support clock gating. These SMs can be released by the applications and stay idle, which waste energy.
Given a batch of workload instances run on a GPU whose execution configurations are fixed, a recent study establishes that consolidating computation workload of multiple kernels is beneficial. Here, workload consolidation refers to merging multiple workloads (either homogeneous, i.e., the same workloads with different input data; or heterogeneous, i.e., different workloads with different input data) to concurrently execute on a GPU. The workload consolidation is different from multi-kernel execution feature offered by NVIDIA's Fermi GPU architecture. For example, regarding Fermi GPUs, while multiple kernels can run on Fermi GPUs, they have to belong to the same context.
One prior art approach demonstrated the performance benefit of merging two workloads into one kernel. Their decision to merge workloads is based on whether each kernel has enough data parallelism to fully utilize GPU resources. While this basic decision making criterion is valid, there are many other reasons for losing performance with task consolidation. Consolidation has the risk of losing performance due to the contention of shared resources, such as GPU global memory bandwidth, shared memory, register file, and constant memory. Moreover, any two or more underutilized workloads cannot be merged due to increased power requirement of the consolidated kernel. Since energy consumption is the product of power and execution time, the performance improvement must be high enough to achieve energy efficiency. To verify whether a consolidated workload is energy efficient or not, one has to develop the code for consolidated workload and execute the code on a GPU.