In the past decade, computer architects have been designing processors that deliver optimal performance. However, due to the nearly cubic relation between processor frequency and power consumption, the typical clock frequency of a single processor cannot be increased beyond a certain value. This so-called “power wall” problem is one of the most critical constraints in processor development. Because of continuously shrinking feature sizes, the number of transistors on a single chip is expected to double in the next few years. Accordingly, manufacturers now include multiple cores on a single chip. Compared to a uniprocessor chip running at an extremely high frequency, a multi-core design can deliver better performance with less power consumption. Chip Multiprocessors (CMPs) have become mainstream products in providing optimal performance while minimizing overall power consumption.
CMPs trim the power budget by integrating multiple processor cores on the same chip instead of raising the clock frequency of a single core. Most CMPs available in the current computer market, such as the Intel i7 960 processor, replicate cores of the same type, simplifying many design issues due to their architectural homogeneity. Recently, computer architects have been developing heterogeneous CMPs which combine different types of cores for better energy efficiency. One such example is the Cell microprocessor co-developed by Sony, Toshiba, and IBM. The Cell is composed of 8 Synergistic Processor Elements (SPE) and one Power Processor Element (PPE), where the advanced processor unit (PPE) works as a controller, and the high throughput comes from the parallel execution on the 8 SPEs. In the alternative, placing a number of cores with an identical instruction set architecture (ISA) but different hardware configurations on the same chip provides another type of heterogeneity. In such cases, programs may be dynamically mapped to the most suitable core(s) according to their resource requirements.
In a heterogeneous CMP, a program scheduler is responsible for program-to-core assignment during runtime. To increase energy efficiency, the scheduler should be aware of differences between integrated processors and program behaviors and make optimal job (i.e. program) assignments to a given core during runtime accordingly. Such functionality is not available in most state-of-the-art schedulers. Rather, most state-of-the-art schedulers were merely designed for homogenous architectures, and thus are not capable of achieving the optimal efficiency on a heterogeneous system. Strategies have been proposed to address this problem, such as a round robin scheduler, sampling-based dynamic scheduling, and latency-aware scheduling; however, these scheduling strategies tend to suffer from various drawbacks. For instance, a round robin scheduler, which periodically migrates jobs among cores, could result in inefficient execution in certain periods because the scheduler cannot determine optimal assignment. Sampling-based dynamic scheduling introduces substantial overhead due to forced migrations that are necessary to check scheduling conditions. Latency-aware scheduling categorizes programs as processor-bound or memory-bound by estimating the last-level cache miss penalties of the programs at runtime. Programs are assigned to different types of cores according to their categorization. However, the last-level cache miss rate is not a good indicator for a program's energy efficiency. Accordingly, current technologies and proposed strategies do not provide optimal scheduling based on differences in the cores and behaviors of programs.