Heterogeneous systems are becoming increasingly common in most market segments including mobiles, tablets, laptops, desktops, and servers. These systems typically incorporate some types of specialized processing cores along with the more general-purpose Central Processing Unit (CPU) cores. The specialized processing cores may include, for example, cores in Graphics Processing Units (GPUs), fixed function hardware cores in Systems on a Chip (SoCs), small cores in SoCs, and specialized cores in servers. While the specialized cores are generally well-suited to perform their domain-specific tasks, they may also be used to perform other more general-purpose tasks. Simultaneously utilizing these specialized cores along with CPU cores often results in significant improvements in performance and energy efficiency making it an attractive option for an application developer trying to maximize benefits from the hardware.
Finding a good partitioning of work between the cores (e.g., load-balancing), however, is generally a complex problem. The division of work between the CPU and a GPU, for example, has been the subject of numerous studies. Existing techniques typically fall into three broad categories, each of which may have associated drawbacks:
(1) Off-line training—A runtime scheduling algorithm is trained on an input data set offline (e.g., a training run execution), and the information obtained is subsequently used during the real runtime execution. The success of this approach depends to a large extent on how accurately the training reflects what occurs during the real runtime execution. Moreover, the training must be repeated for each new platform.
(2) Use of a performance model—Accurate performance models are difficult to construct, particularly for irregular workloads (e.g., where distribution of the work can vary significantly between processors) since runtime behavior is highly dependent on characteristics of the input data.
(3) Extend standard work-stealing with restrictions on stealing—Since the GPU typically cannot initiate communication with the CPU, addressing the problem of load imbalance may be limited to use of extensions where work is pushed to GPUs (e.g., work-stealing). Such approaches incur overheads on CPU execution since the CPU has to act on behalf of the GPU workers or threads.
Although the following Detailed Description will proceed with reference being made to illustrative embodiments, many alternatives, modifications, and variations thereof will be apparent to those skilled in the art.