An arithmetic processing device (CPU: central processing unit or processor) has a plurality of cores, a last level cache (hereinafter, LLC) which is nearest to the main memory in the memory hierarchy and which is shared by the plurality of cores, and a memory controller. The memory hierarchy includes, for example, a level 1 cache (L1 cache) which is provided inside the core, and a level 2 cache (L2 cache) which is provided outside the core, is shared by the plurality of cores, and is connected to the main memory outside the processor. In this case, the L2 cache which is nearest to the main memory corresponds to the LLC. Alternatively, if the memory hierarchy has an L1 cache and an L2 cache in the cores, and outside the cores, a level 3 cache (L3 cache) which is connected to the main memory outside the processor and which is shared by the plurality of cores, then the L3 cache nearest to the main memory corresponds to the LLC.
In any hierarchical structure, if a cache miss occurs in the LLC, then the LLC issues a fetch request to the memory controller of the processor managing the data, and the memory controller accesses the main memory, reads out the data, and provides a data response to the LLC originating the request. The LLC originating the request registers the read out data in the cache (cache fill), and also sends a data response to the core.
The volume of caches is tending to increase. In other words, the number of cores integrated into a CPU chip is increasing with process miniaturization. Furthermore, with the increase in the number of cores (number of threads), the associativity (number of ways) in a set associative cache rises. Along with this, the volume of the LLC which is shared by the plurality of cores also increases. Consequently, high-end processor chips are tending to increase in size with improvements in performance, despite the reduction in surface area achieved by the miniaturization.
In the midst of these circumstances, if an LLC configuration is adopted which enables a many-core processor equal access from each of the cores, then the data access path to the LLC becomes longer, due to the large chip size and the large-capacity LLC, and the hit latency of the LLC increases.
Therefore, rather than a single LLC configuration which is shared by all of the cores, a configuration has been proposed in which the LLC is divided into a plurality of caches, and a plurality of core groups respectively share each of the divided LLCs. In a configuration of this kind, the LLC shared by each core group has a smaller capacity, the physical distance from the core to the LLC in the core group is shorter, the control is also simplified, and high-speed access becomes possible. In other words, compared to a single large-capacity LLC configuration which permits equal access from all of the cores, adopting a configuration which includes a plurality of core groups in which a limited number of cores share LLC of small capacity enables faster hit latency of the LLCs.