Computer systems that depend on compiler-directed coherence require that all remote data be flushed from the caches at the beginning and end of parallel loops. This is done to make sure that all modifications during the loop are made visible to all other processors. With large L3 caches (32 MB or greater) becoming common, brute-force cache flushing at the beginning and end of loops can take a substantial amount of time, thus causing a large performance degradation in the application. For example, a 128 MB L3 that is 30% dirty takes at least 0.8 milliseconds to flush using a 50 GB/sec interconnect to main memory.
This problem also arises in another context. In multi-tier clustered systems it is sometimes desirable to maintain replicas of memory across multiple nodes in the cluster. Periodically, the replicas must be put in a consistent state by flushing all cached data out to the checkpoint copies. Schemes that accelerate checkpoint function in hardware must ensure that all modified data in the hardware caches are propagated to all copies of memory. The amount of time that is required to perform the cache flushing is dependent on cache write-back policies. These policies can be broken into two basic types. One type is a write-through cache, which ensures that a cache never contains any dirty data. Although this ensures that no cache flushing is ever needed, it introduces a substantial amount of write-through traffic that exceeds the traffic capacity of any cost effective interconnect at present time. Alternatively, a write-back cache allows one or more cache entries (e.g., one or more cache lines) to remain dirty in cache until they are evicted. While write-through traffic is eliminated, streaming data may cause bursty write-backs (e.g., large amounts of cache lines are flushed in a short duration) causing bottlenecks on the interconnect. A variant of a write-back cache is called eager “write-back.” Eager “write-back” flushes some of the dirty cache lines when it determines there are idle bus cycles instead of waiting for the dirty line to be evicted. This lowers the possibility of bursty write-backs causing a traffic bottleneck on the interconnect. It does not address the performance issue of needing to flush large amounts of cache lines at the beginning and end of parallel loops or upon executing a hardware checkpoint function. Accordingly, there is a need for a method and system to reduce the cache flushing time and improve the performance.