The present invention relates to the field of data processing. More particularly, this invention relates to the configuration of a processor core configured to carry out data processing operations.
Heterogeneous multicore systems—comprised of multiple cores with varying capabilities, performance, and energy characteristics—have emerged as a promising approach to increasing energy efficiency and alleviating serial bottlenecks. The big.LITTLE technology provided by ARM Limited, Cambridge, UK is one example. This technology combines a set of Cortex-A15 (“big”) cores with Cortex-A7 (“LITTLE”) cores to create a heterogeneous processor. The Cortex-A15 is a 3-way out-of-order device with deep pipelines (15-25 stages). Conversely, the Cortex-A7 is a narrow in-order processor with a relatively short pipeline (8-10 stages). The Cortex-A15 has 2-3× higher performance, but the Cortex-A7 is 3-4× more energy efficient. Such systems reduce energy consumption by identifying phase changes in an application and migrating execution to the most efficient core that meets its current performance requirements. Known designs select the best core by briefly sampling performance on each. However, every time the application migrates between cores, its current state must be explicitly transferred or rebuilt on the new core. This state transfer incurs large overheads that limits migration between cores to a granularity of tens to hundreds of millions of instructions. To mitigate these effects, the decision to migrate applications is done at the granularity of operating system time slices.
R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen, “Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction,” in Proc. of the 36th Annual International Symposium on Microarchitecture, December 2003, pp. 81-92, considers migrating thread context between out-of-order and in-order cores for the purposes of reducing power. At coarse granularities of 100M instructions, one or more of the inactive cores are sampled by switching the thread to each core in turn. The switches comprise flushing dirty L1 data to a shared L2, which is slow and energy consuming.
Rather than relying on sampling the performance on both cores, K. Van Craeynest, A. Jaleel, L. Eeckhout, P. Narvaez, and J. Emer, “Scheduling heterogeneous multi-cores through performance impact estimation (pie),” in Proceedings of the 39th International Symposium on Computer Architecture, ser. ISCA '12, 2012, pp. 213-224 proposes a coarse-grained mechanism that relies on measures of CPI, MLP, and ILP to predict the performance on the inactive core. On the other hand, K. K. Rangan, G.-Y. Wei, and D. Brooks, “Thread motion: fine-grained power management for multi-core systems,” in Proc. of the 36th Annual International Symposium on Computer Architecture, 2009, pp. 302-313 examines a CMP with clusters of in-order cores sharing L1 caches. While the cores are identical architecturally, varied voltage and frequency settings create performance and power heterogeneity. A simple performance model is made possible by having exclusively in-order cores, and thread migration is triggered every 1000 cycles by a history-based (last value) predictor.
Another class of work targets the acceleration of bottlenecks to thread parallelism. Segments of code constituting bottlenecks are annotated by the compiler and scheduled at runtime to run on a big core. M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt, “Accelerating critical section execution with asymmetric multi-core architectures,” in 17th International Conference on Architectural Support for Programming Languages and Operating Systems, 2009, pp. 253-264 describes a detailed architecture and target critical sections, and J. A. Joao, M. Suleman, O. Mutlu, and Y. N. Patt, “Bottleneck identification and scheduling in multithreaded applications,” in 20th International Conference on Architectural Support for Programming Languages and Operating Systems, 2012, pp. 223-234 generalizes this work to identify the most critical bottlenecks at runtime. G. Patsilaras, N. K. Choudhary, and J. Tuck, “Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era,” ACM Trans. Archit. Code Optim., vol. 8, no. 4, pp. 28:1-28:21, January 2012 proposes building separate cores, one that targets MLP and the other that targets ILP. They then use L2 cache miss rate to determine when an application has entered a memory intensive phase and map it to the MLP core. When the cache misses decrease, the system migrates the application back to the ILP core.
Other work studies the benefits of heterogeneity in real systems. M. Annavaram, E. Grochowski, and J. Shen, “Mitigating Amdahl's law through EPI throttling,” in Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005, pp. 298-309 shows the performance benefits of heterogeneous multi-cores for multithreaded applications on a prototype with different frequency settings per core. Y. Kwon, C. Kim, S. Maeng, and J. Huh, “Virtualizing performance asymmetric multi-core systems,” in Proc. of the 38th Annual International Symposium on Computer Architecture, 2011, pp. 45-56 motivates asymmetry-aware hypervisor thread schedulers, studying cores with various voltage and frequency settings. D. Koufaty, D. Reddy, and S. Hahn, “Bias scheduling in heterogeneous multi-core architectures,” in Proc. of the 5th European Conference on Computer Systems, 2010, pp. 125-138 discovers an application's big or little core bias by monitoring stall sources, to give preference to OS-level thread migrations which migrate a thread to a core it prefers. A heterogeneous multi-core prototype is produced by throttling the instruction retirement rate of some cores down to one instruction per cycle.
Other designs propose allowing a thread to adapt (borrow, lend, or combine) hardware resources, and still other designs allow dynamic voltage/frequency scaling (DVFS). Alternatively, asymmetry can be introduced by dynamically adapting a core's resources to its workload. Prior work has suggested adapting out-of-order structures such as the issue queue (see R. Bahar and S. Manne, “Power and energy reduction via pipeline balancing,” Proc. of the 28th Annual International Symposium on Computer Architecture, vol. 29, no. 2, pp. 218-229, 2001), as well as other structures such as ROBs, LSQs, and caches (see: D. Ponomarev, G. Kucuk, and K. Ghose, “Reducing power requirements of instruction scheduling through dynamic allocation of multiple datapath resources,” in Proc. of the 34th Annual International Symposium on Microarchitecture, December 2001, pp. 90-101; R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas, “Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures,” in Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, 2000, pp. 245-257; and D. Albonesi, R. Balasubramonian, S. Dropsbo, S. Dwarkadas, E. Friedman, M. Huang, V. Kursun, G. Magklis, M. Scott, G. Semeraro, P. Bose, A. Buyuktosunoglu, P. Cook, and S. Schuster, “Dynamically tuning processor resources with adaptive processing,” IEEE Computer, vol. 36, no. 12, pp. 49-58, December 2003).
R. Kumar, N. Jouppi, and D. Tullsen, “Conjoined-core chip multiprocessing,” in Proc. of the 37th Annual International Symposium on Microarchitecture, 2004, pp. 195-206 explored how a pair of adjacent cores can share area expensive structures, while keeping the floorplan in mind H. Homayoun, V. Kontorinis, A. Shayan, T.-W. Lin, and D. M. Tullsen, “Dynamically heterogeneous cores through 3d resource pooling,” in Proc. of the 18th International Symposium on High-Performance Computer Architecture, 2012, pp. 1-12 examined how micro-architectural structures can be shared across 3D stacked cores. These techniques are limited by the structures they adapt and cannot for instance switch from an out-of-order core to an in-order core during periods of low ILP.
E. Ipek, M. Kirman, N. Kirman, and J. Martinez, “Core fusion: Accommodating software diversity in chip multiprocessors,” in Proc. of the 34th Annual International Symposium on Computer Architecture, 2007, pp. 186-197 and C. Kim, S. Sethumadhavan, M. S. Govindan, N. Ranganathan, D. Gulati, D. Burger, and S. W. Keckler, “Composable lightweight processors,” in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007, pp. 381-394 describe techniques to compose or fuse several cores into a larger core. While these techniques provide a fair degree of flexibility, a core constructed in this way is generally expected to have a datapath that is less energy efficient than if it were originally designed as an indivisible core of the same size.
DVFS approaches reduce the voltage and frequency of the core to improve the core's energy efficiency at the expense of performance. However, when targeted at memory-bound phases, this approach can be effective at reducing energy with minimal impact on performance. Similar to traditional heterogeneous multicore systems, the overall effectiveness of DVFS suffers from coarse-grained scheduling intervals in the millisecond range. In addition, providing independent DVFS settings for more than two cores is costly in terms of both area and energy. Despite these limitations, DVFS is still widely used in production processors today, and has for example been incorporated into the above-mentioned ARM big.LITTLE heterogeneous multicore system. Two competing techniques to enable fine-grained DVFS, fast on-chip regulators (see W. Kim, M. S. Gupta, G.-Y. Wei, and D. Brooks, “System level analysis of fast, per-core DVFS using on-chip switching regulators,” in Proc. of the 14th International Symposium on High-Performance Computer Architecture, 2008, pp. 123-134 and W. Kim, D. Brooks, and G.-Y. Wei, “A fully-integrated 3-level DCDC converter for nanosecond-scale DVFS,” IEEE Journal of Solid-State Circuits, vol. 47, no. 1, pp. 206-219, January 2012) and dual voltage rails (see T. N. Miller, X. Pan, R. Thomas, N. Sedaghati, and R. Teodorescu, “Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips,” in Proc. of the 18th International Symposium on High-Performance Computer Architecture, vol. 0, 2012, pp. 1-12 and R. Dreslinski, “Near threshold computing: From single core to manycore energy efficient architectures,” Ph.D. dissertation, University of Michigan, 2011), have recently been proposed that promise to deliver improved transition latencies.
Despite these varied advances in the technology, the applicant considers that there remains the opportunity to improve on the prior art.