This document relates to migration of execution in a multiple processor computing system.
The steady increases in processor performance obtainable from increasing clock frequencies have largely come to a halt in recent years at least in part because there is no cost-effective way to dissipate the heat generated by processors with extremely high clock frequencies. As a result, recent development efforts have favored multi-core parallelism. Commodity processors with four or eight cores on a single die have become common, and existing technology permits including many more processors on a single die. General-purpose single-die multiprocessors with as many as 64 cores are already commercially available. Even larger multi-core processors have been built, and it is possible that dies will soon include thousands of cores.
One major concern for multi-core processor designers is the design of a scalable memory subsystem for the processor. Increasing the number of concurrent threads requires a large aggregate memory bandwidth, but off-chip memory bandwidth grows with package pin density which scales much more slowly than on-die transistor density. Thus off chip memory bandwidth is severely constrained by the number of pins on an integrated circuit package. This constraint is known as the off-chip memory bandwidth wall. To address this problem, many multi-core processors integrate large private and shared caches on chip. The hope is that large caches can hold the working sets of the active threads, thereby reducing the number of off-chip memory accesses. Private caches, however, require a mechanism for maintaining coherency between caches, and shared caches do not general scale beyond a few cores.
Since shared caches do not scale, the distribution of many private caches close to processor cores is a preferred option in large-scale multi-core processors. In some approaches, each processor core is associated with a local cache, all other caches are considered as remote caches. Accessing remote cache lines is significantly slower than accessing local caches. The caches store data that is accessed by threads running on a core that is connected to the cache. In practice, this means that some form of memory coherence or other memory access control is generally needed. Creating memory coherence can create the illusion of a shared memory but scaling memory coherence algorithms to multi-core processors that include thousands of cores presents significant problems.
Some multi-core processors use bus-based cache coherence, which provides the illusion of a single, consistent memory space. However, bus-based cache coherence does not generally scale beyond a few cores. Other multi-core processors use directory-based cache coherence. Directory-based cache coherence is not subject to some of the limitations of buses, but can require complex states and protocols for efficiency even in relatively small multi-core processors. Furthermore, directory-based protocols can contribute significantly to the already costly delays of accessing off-chip memory because data replication limits the efficient use of cache resources. Additionally, directory-based protocols that have one large directory are often slow and consume large amounts of power. Finally, the area costs of keeping directory entries can be a large burden: if most of the directory is kept in off-chip memory, accesses will be too slow, but if the directory is stored in a fast on-chip memory, evictions from the directory cause thrashing in the per-core caches, also decreasing performance.
The abundance of interconnect bandwidth included with on-chip multi-core processors provides an opportunity for optimization. Existing electrical on-chip interconnect networks offer terabits per second of cross-section bandwidth with latencies growing with the diameter of the network (i.e., as the square root of the core count in meshes), and emerging 3D interconnect technologies enable high-bandwidth, low-latency on-chip networks. Optical interconnect technology, which offers high point-to-point bandwidth at little latency and with low power, is fast approaching miniaturization comparable to silicon circuits, with complete ring lasers no larger than 20 μm2. Multi-core architectures featuring an on-chip optical interconnect have been proposed, but have so far been based on traditional cache-coherent memory architectures.