Many of today's processors are implemented as multi-core processors in which multiple or many cores are present on a single semiconductor die. Oftentimes, the cores include a first level cache, and are associated with other cache levels to store frequently or recently accessed data. One possible cache hierarchy for multi-core chips is to have one or more levels of private cache per core, and a distributed tag directory (TD) to maintain coherence between the different cores' private caches. To reduce off-die accesses to shared data, the TD may support cache-to-cache transfers between different cores' private caches. However, concurrent reads for the same cache line are serialized, and the throughput of handling requests for those shared lines is limited by the latency of pending cache-to-cache transfers. In contrast, shared cache hierarchies in which one or more cache levels are shared by multiple cores may directly respond to read requests for data being read-shared by other cores; by the nature of a shared cache, it can hold a copy of read-shared lines. The line will never move to a pending state as in the above private cache situation, so the throughput of such read requests is limited only by the shared cache request throughput.
Still further, application performance may be limited by throughput in a private cache hierarchy if the application uses many threads and the cores on which those threads run have frequent misses to the same cache line. A number of applications exhibit this behavior, and thus have lower performance on private caches than on a shared cache.