1. Field of the Invention
This invention relates to computing systems, and more particularly, to automatically migrating the execution of work units between multiple heterogeneous cores.
2. Description of the Relevant Art
The parallelization of tasks is used to increase the throughput of computer systems. To this end, compilers may extract parallelized tasks from program code to execute in parallel on the system hardware. With a single-core architecture, a single core may include deep pipelines configured to perform multi-threading. To further increase parallel execution on the hardware, a multi-core architecture may include multiple general-purpose cores. This type of architecture may be referred to as a homogeneous multi-core architecture. This type of architecture may provide higher instruction throughput than a single-core architecture.
Some software applications may not be divided frequently into parallel tasks. In addition, specific tasks may not efficiently execute on a general-purpose core. Particular instructions for a computational intensive task may cause a disproportionate share of a shared resource, which delays a deallocation of the shared resource. Examples of such specific tasks may include cryptography, video graphics rendering and garbage collection.
To overcome the performance limitations of conventional general-purpose cores, a computer system may offload specific tasks to special-purpose hardware. This hardware may include a single instruction multiple data (SIMD) parallel architecture, a field-programmable gate array (FPGA), and other specialized cores. A type of architecture with different types of cores may be referred to as a heterogeneous multi-core architecture. Depending on the scheduling of tasks, this type of architecture may provide higher instruction throughput than a homogeneous multi-core architecture.
In many cases, particular software applications have data parallelism in which the execution of each work item, or parallel function call, is data dependent within itself. For example, a first work item may be data independent from a second work item, and each of the first and the second work items are scheduled on separate paths within a core with a SIMD micro-architecture. However, an amount of instructions executed within each of the first and the second work items may be data-dependent. A conditional test implemented as a branch instruction may pass for the first work item, but fail for the second work item dependent on the data for each work item.
The efficiency of parallel execution may be reduced as the second work item halts execution and waits as the first work item continues with its ongoing execution. The inefficiency grows when only a few work items continue execution due to passed tests whereas most of the work items are idle due to failed tests. After efficient, functionality-matching assignment of the work items by an OS scheduler in a heterogeneous multi-core architecture, system performance may still be reduced due to the data-dependent behavior of particular software applications.