The present invention generally relates to cache coherency in multiprocessor data processing systems, and more particularly to cache coherency in systems that implement weak memory models.
Multi-processor data processing systems harness the collective computation power of a multitude of processors. The memory system is central to a multi-processor system and must be scalable in order to provide sufficient bandwidth to each processor while sharing data between the multiple processors. For certain applications, an efficient means of sharing data is critical to effective collaboration between the multiple processors.
Cache coherence must be addressed in multi-processor systems with shared memory. Cache coherence protocols address the issue of ensuring that no processors in the system are using stale data in the local caches. In general, stale cache entries can be eliminated by either invalidating in the caches all but the most recently updated cache data or updating the caches with the most recent data. In a system using the invalidation protocol, an attempt to access an invalidated memory location from cache will cause the processor to read a copy of the most recent data either from another cache or from main memory. In the update protocol, following a write operation all the caches having a cached version of the data are updated with the most recent data. Thus, the most recent data are available in the caches.
The memory model implemented in a multi-processor system also influences system performance and cache coherence design. Generally, there are two types of memory models: strong memory models and weak memory models. The strong memory model is also referred to as the sequential consistency memory model. The sequential consistency comes from the requirement that all processors in the system must see all memory operations as occurring in the same relative order. Sequential consistency constrains the implementations of both the cache-coherence protocol and the memory system.
Weak memory models do not require the strong guarantee of sequential consistency for all its memory accesses. Instead, code running on one processor that is producing data for another processor will explicitly indicate to the other processor that data are ready. This indication is done using synchronization operations. The data resulting from store operations by one processor prior to a synchronization operation are not expected to be read by another processor until after the synchronization operation occurs. The relative order of the store operations is immaterial to the other processor. However, by the time a processor sees the synchronization operation, the processor must no longer see any of the old data that have been overwritten by the store operations that preceded the synchronization operation. Weak memory models permit higher-performance implementations.
While most current hardware cache-coherence implementations adopt some form of invalidation protocol, certain data sharing patterns trigger very bad performance with invalidation protocols. An example pattern is where one or more processors read or write a cache-line during the time another processor is storing to that cache-line. This pattern occurs even for what is called false sharing, where the former processors are using parts of the cache-line that are not being stored to by the latter processor. Update-based protocols with multi-writer support deal well with false sharing. False sharing typically becomes more serious with larger cache-lines. With the present trend toward larger cache-lines, false sharing is expected to become a more serious problem over time.
Many hardware-based shared memory systems implement a version of the invalidation-based cache coherence protocol because update-based systems, as generally implemented, create much overhead. The overhead created in an update-based system is caused by broadcasting or multi-casting an update message, in response to each store operation, to all caches that potentially could have copies of the data. In particular, if a processor performs several store operations to the same cache-line, current implementations send update messages for each store operation. This results in a large number of update operations, thereby impairing system performance.
A system and method that address the aforementioned problems, as well as other related problems, are therefore desirable.
The invention provides various arrangements and method for cache management in a shared memory system. Each of a plurality of intercoupled processing nodes includes a higher-level cache and a lower-level cache having corresponding cache lines. At each node, update-state information is maintained in association with cache lines in the higher-level cache. The update-state information for a cache line tracks whether there is pending update that needs to be distributed from the node. In response to a write-back operation referencing an address cached at a node, the node generates difference data that specifies differences between data in a cache line for the address in the higher-level cache and data in a corresponding cache line in the lower-level cache. The difference data are then provided to one or more other nodes with cached versions of the cache line for the address.
Various example embodiments are set forth in the Detailed Description and Claims which follow.