Commercial CPUs, such as a CPU designed based on the x86 architecture, usually adopts a hierarchical caching structure between CPU core(s) and the main memory (e.g., a dynamic random-access memory (DRAM)). The caching structure may include multiple levels of caches, ranging from fast but small (i.e., low storage capacity) lower level caches to slow but large (i.e., high storage capacity) higher level caches. Data accessing (e.g., loading/reading data from the memory or storing/writing data to the memory) is performed by the CPU through the multi-level caches to reduce latency and improve data accessing speed. For example, when the CPU performs a read/load operation, the CPU first seeks a copy of the data in the lowest level cache. If the copy is found (also referred to as a “hit”), then the CPU fetches the copy from the lowest level cache without reaching out to the higher level caches or the main memory. If the copy is not found in the lowest level cache (also referred to as a “miss”), then the CPU seeks the copy in the next, higher level cache, and repeats the process through all levels of caches. If the data is not found in all levels of caches, the CPU then fetches the data from the main memory.
Among the multiple level of caches, the last-level cache (LLC) is usually at the third level in the x86 architecture so it may also be referred to as an L3 cache. An LLC usually has a few tens of megabytes (MBs) in size, and causes relatively long access latency compared to the lower level caches (e.g., level-one (L1) and level-two (L2) caches). In existing commercial (e.g., off-the-shelf) multi-core CPUs, the LLC is usually split into smaller slices and distributed and interconnected among multiple CPU cores. For example, each slice is attached to one core. A single piece of data is stored in a distributed manner across multiple slices, and thus across multiple cores. This design allows the entire LLC capacity to be shared by multiple cores. But it also increases the LLC hit latency and increases power consumption. This is because a CPU core has to undergo multiple hops to fetch portions of the same piece of data from LLC slices attached to other cores, regardless of whether the data is private to a single core or not. The multiple hops cause latency and consume power, which are significant considering the accumulating effect in a data center environment where hundreds of thousand CPUs operating all of the time.