1. Technical Field
The present invention relates in general to the field of computers, and in particular to clustered shared-memory multiprocessors. More particularly, the present invention relates to an efficient region coherence protocol for clustered shared-memory multiprocessor systems.
2. Description of the Related Art
To reduce global bandwidth requirements within a computer system, many modern shared-memory multiprocessor systems are clustered. The processors are divided into groups called symmetric multiprocessing nodes (SMP nodes), such that processors within the same SMP node may share a physical cabinet, a circuit board, a multi-chip module, or a chip, thereby enabling low-latency, high-bandwidth communication between processors in the same SMP node. Two-level cache coherence protocols exploit this clustering configuration to conserve global bandwidth by first broadcasting memory requests for a line of data from a processor to the local SMP node, and only sending memory requests to other SMP nodes if necessary (e.g., if it is determined from the responses to the first broadcast that the requested line is not cached on the local SMP node). While this type of two-level cache coherence protocol reduces the computer system global bandwidth requirements, memory requests that must eventually be broadcast to other SMP nodes are delayed by the checking of the local SMP node first for the requested line, causing the computer system to consume more SMP node bandwidth and power. It is important for performance, scalability, and power consumption to first send memory requests to the appropriate portion of the shared-memory computer system where the cached data is most likely to be found.
There have been prior proposals for improved request routing in two-level cache coherence protocol systems in clustered multiprocessor systems, such as, for example Power6 systems by IBM Corporation. For example, the In, and Ig “pseudo invalid” states in the coherence protocols of such systems are used to predict whether a requested line of data is cached on the local SMP node, or on other SMP nodes. However, there are several limitations to using these states.
First, a line of data must be brought into a processor cache and subsequently must be taken away by intervention to reach one of these states. These states only optimize subsequent requests by the processor to reacquire the cache line of data (temporal locality) and do not optimize the initial access to the data since a line request is sent to all processors. Second, they do not exploit spatial locality beyond the cache line, so a processor must collect and store information for each such line of the cache. Third, these states take up space in the processor's cache hierarchy, displacing valid data. Fourth, these states only help if they can remain in the cache hierarchy long enough, before being replaced by valid data, for the data to be accessed again. Finally, additional states must be added to handle additional levels of hierarchy in the system interconnect (for example where a separate hierarchical level exists for processors on a single chip, on a module, on a board, on an SMP node, or on a cabinet), thereby increasing cache coherence protocol complexity.
Use of these “pseudo invalid” states does not exploit spatial locality beyond the line of data requested, and does not define a region coherence protocol.
There have been prior proposals for Region Coherence Arrays (RCAs) which optimize global bandwidth by keeping track of regions from which the processor is caching lines, and whether other processors are caching lines from those regions. However, these proposals are for multiprocessor systems that are not clustered—that is, there is a single, flat interconnect of processors. As such, these proposals for RCAs are suboptimal for clustered multiprocessor systems having hierarchical interconnects, since they cannot exploit cases where data is shared, for example, by only processors on the same SMP node. Furthermore, these proposals include RCAs which invalidated regions from which the processor is no longer caching lines in response to external requests. This dynamic self-invalidation made it easier for other processors to obtain exclusive access to regions, however the processor receiving the request threw away useful information that could have been used to optimize subsequent requests.