1. Technical Field
The present invention generally relates to data processing systems and in particular to clustered shared-memory multiprocessors. More particularly, the present invention relates to ordering cache accesses in clustered shared-memory multiprocessor systems.
2. Description of the Related Art
To reduce global bandwidth requirements within a computer system, many modern shared-memory multiprocessor systems are clustered. The processors are divided into groups called symmetric multiprocessing nodes (SMP nodes), such that processors within the same SMP node may share a physical cabinet, a circuit board, a multi-chip module, or a chip, thereby enabling low-latency, high-bandwidth communication between processors in the same SMP node. Two-level cache coherence protocols exploit this clustering configuration to conserve global bandwidth by first broadcasting memory requests for a line of data from a processor to the local SMP node, and only sending memory requests to other SMP nodes if necessary (e.g., if it is determined from the responses to the first broadcast that the requested line is not cached on the local SMP node). While this type of two-level cache coherence protocol reduces the computer system global bandwidth requirements, memory requests that must eventually be broadcast to other SMP nodes are delayed by the checking of the local SMP node first for the requested line, causing the computer system to consume more SMP node bandwidth and power. It is important for performance, scalability, and power consumption to first send memory requests to the appropriate portion of the shared-memory computer system where the cached data is most likely to be found.
Coarse-Grain Coherence Tracking with the aid of Region Coherence Arrays is a technique that can improve the performance, scalability, and power consumption of broadcast-based, shared-memory multiprocessor systems. Region Coherence Arrays track coherence status at a coarse granularity, and use this information to route memory requests in order to minimize request latency, conserve interconnect bandwidth and reduce power consumption.
There are three implementation considerations with Coarse-Grain Coherence Tracking facilitated by Region Coherence Arrays: area, latency, and power consumption. First, Region Coherence Arrays need to be somewhat large to be effective, such that the Region Coherence Arrays map several times the data contained in the processor's cache hierarchy. Empirical results show that Region Coherence Arrays with 4 KB regions need at least one-fourth the number of locations of the processor's cache hierarchy to be effective (assuming a 128-byte cache line). Thus, Region Coherence Arrays consume a significant area in facilitating Coarse-Grain Coherence Tracking. Second, and in part due to their size, Region Coherence Arrays may need to be accessed in parallel with the lowest-level cache to minimize the latency added to external requests. The region coherence state is used when a cache miss is detected to route the external request. Third, Region Coherence Arrays can be power-hungry. The non-trivial size of Region Coherence Arrays and a need to be accessed in parallel with the lowest-level cache can lead to considerable power consumption. Thus, power is wasted accessing the Region Coherence Array on lowest-level cache hits.