Improving the performance of computer or other processing systems generally improves overall throughput and/or provides a better user experience. One technique of improving the overall quantity of instructions processed in a system is to increase the number of processors in the system. Implementing multiprocessing (MP) systems, however, typically requires more than merely interconnecting processors in parallel. For example, tasks or programs may need to be divided so they can execute across parallel processing resources, memory consistency systems may be needed, etc.
As logic elements continue to shrink due to advances in fabrication technology, integrating multiple processors into a single component becomes more practical, and in fact a number of current designs implement multiple processors on a single component or chip.
Chip multiprocessors (CMPs) hold the prospect of delivering long-term performance scalability while dramatically reducing design complexity compared to monolithic wide-issue processors. Complexity is reduced by designing and verifying a single, relatively simple core, and then replicating it. Performance is scaled by integrating larger numbers of cores on the die and harnessing increasing levels of thread level parallelism (TLP) with each new technology generation.
Unfortunately, high-performance parallel programming constitutes a tedious, time-consuming, and error-prone effort.
In that respect, the complexity shift from hardware to software in ordinary CMPs is one of the most serious hurdles to their success. In the short term, on-chip integration of a modest number of relatively powerful (and relatively complex, cores may yield high utilization when running multiple sequential workloads, temporarily avoiding the complexity of parallelization. However, although sequential codes are likely to remain important, they alone are not sufficient to sustain long-term performance scalability. Consequently, harnessing the full potential of CMPs in the long term makes the adoption of parallel programming very attractive.
To amortize the cost of parallelization, many programmers choose to parallelize their applications incrementally. Typically, the most promising loops/regions in a sequential execution of the program are identified through profiling. A subset of these regions are then parallelized, and the rest of the application is left as “future work.” Over time, more effort is spent on portions of the remaining code. We call these evolving workloads. As a result of this “pay-as-you-go” approach, the complexity (and cost) associated with software parallelization is amortized over a greater time span. In fact, some of the most common shared-memory programming models in use today (for example, OpenMP) are designed to facilitate the incremental parallelization of sequential codes. We envision a diverse landscape of software in different stages of parallelization, from purely sequential, to fully parallel, to everything in between. As a result, it will remain important to efficiently support sequential as well as parallel code, whether standalone or as regions within the same application at run time. This requires a level of flexibility that is hard to attain in ordinary CMPs.
Asymmetric chip multiprocessors (ACMPs) attempt to address this by providing cores with varying degrees of sophistication and computational capabilities. The number and the complexity of cores are fixed at design time. The hope is to match the demands of a variety of sequential and parallel workloads by executing them on an appropriate subset of these cores. Recently, the impact of performance asymmetry on explicitly parallelized applications has been studied, finding that asymmetry hurts parallel application scalability and renders the applications' performance less predictable unless relatively sophisticated software changes are introduced. Hence, while ACMPs may deliver increased performance on sequential codes, they may do so at the expense of parallel performance, requiring a high level of software sophistication to maximize their potential.
Instead of trying to find the right design trade-off between complex and simple cores (as ACMPs do), there is a need for a CMP that provides the flexibility to dynamically synthesize the right mix of simple and complex cores based on application requirements.