1. Field of the Invention
The present invention relates to multiprocessor non-uniform cache architecture systems, and, more particularly, cache line placement prediction for multiprocessor non-uniform cache architecture systems.
2. Description of the Related Art
As semiconductor technology advances, growing wire delays are becoming a dominant factor in overall cache access latencies. A Non-Uniform Cache Architecture (“NUCA”) system comprises multiple cache portions, wherein different cache portions have different access latencies due to different distances from an accessing processor. The time required to access a data item from a non-uniform cache largely depends on where the data item is located in the non-uniform cache, instead of the actual tine used to retrieve the data item from a data array.
Designing a NUCA system require numerous architectural issues to be considered. For example, questions for consideration may include: (1) How to map memory addresses to different cache portions; (2) How to connect cache portions with each other; (3) How to search a memory address in a non-uniform cache; (4) Where to place a cache line when it is brought into a non-uniform cache from the memory; and (5) How to ensure coherence if data of a cache line can be replicated in multiple cache portions. Decisions on these and other architectural issues may have a profound impact on the overall performance and complexity of the NUCA system.
It is generally desirable to allow a cache line to migrate from one cache portion to another to reduce cache access latencies. For example, in a uni-processor NUCA system, a promotion scheme may allow a cache line to gradually migrate toward the processor each time the cache line is accessed (see “An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches”, in Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, by C. Kim, D. Burger and S. Keckler). It should be understood that, in a multiprocessor NUCA system, the promotion scheme can often cause a cache line to “ping-pong” among multiple cache portions if the cache line is accessed by multiple processors.
Referring now to FIG. 1, an exemplary multiprocessor NUCA system 100 includes a number of central processing unit (“CPU”) cores. As shown, the CPU cores share one non-uniform cache 105 that is partitioned into multiple cache portions. Given a CPU core, different cache portions with varying physical distances from the CPU core may have different cache access latencies because of varying communication delays. Although not so labeled in FIG. 1, each CPU core may have one or more local cache portions and one or more remote cache portions. A local cache portion refers to a cache portion that is physically closer to the corresponding CPU core than the corresponding remote cache portions. It should be understood that directory information (not shown), typically including cache tags, coherence states and LRU bits, can be maintained in a centralized location or distributed with corresponding cache portions.
Referring now to FIG. 2, an exemplary multiprocessor NUCA system 200 is shown. The system 200 comprises two CPU cores, CPU core 0 and CPU core 1, sharing one non-uniform cache. The non-uniform cache comprises two cache portions, cache portion 0 and cache portion 1, wherein each cache portion can be further partitioned into two cache slices. From the perspective of CPU 0, cache portion 0 is local and cache portion 1 is remote because cache portion 0 is physically closer to CPU 0 than cache portion 1; likewise, from the perspective of CPU 1, cache portion 1 is local and cache portion 0 is remote because cache portion 1 is physically closer to CPU 1 than cache portion 0. A communication fabric allows a CPU core to access either of the two cache portions, and can be used to migrate data from a cache portion to another if necessary.
Referring now to FIG. 3, another exemplary multiprocessor NUCA system 300 is shown. The system comprises four CPU cores, CPU core 0, CPU core 1, CPU core 2 and CPU core 3, sharing one non-uniform cache. The non-uniform cache comprises four cache portions, cache portion 0, cache portion 1, cache portion 2 and cache portion 3, which are local to CPU core 0, CPU core 1, CPU core 2 and CPU core 3, respectively. The CPU cores and cache portions can communicate with each other via a communication fabric.