Almost all computer systems use caches. A cache is a hardware component that stores data so that future requests for that data can be served faster. A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. Relative to main memory, a cache is a smaller, faster memory that stores copies of data from the most frequently used main memory locations. As long as most memory accesses are cached memory locations, the average latency of memory accesses will be closer to the cache latency than to the latency of main memory.
Typically, data is transferred between memory and cache in blocks of fixed size, referred to as cache lines. When a cache line is copied from memory into a cache, a cache entry is created. The cache entry includes the copied data and the requested memory location (sometimes referred to as a “tag”). When a processor needs to read or write a location in main memory, the processor first checks for a corresponding entry in the cache. The cache checks for the contents of the requested memory location in any cache lines that might contain that address. If the processor finds that the memory location is in the cache, a cache hit has occurred; otherwise, a cache miss occurs. In other words, a “cache miss” refers to a failed attempt to read or write a piece of data in the cache, which results in a main memory access that is associated with much longer latency. In the case of a cache hit, the processor immediately reads or writes the data in the cache line. In the case of a cache miss, the cache may allocate a new entry and copy in data from main memory. Then, the request is fulfilled from the contents of the cache.
Shared memory multiprocessor systems are increasingly common. Each processor (or core) typically includes its own cache to store frequently accessed data items. Each processor has access to and operates on the same (shared) data. An issue that must be addressed in shared memory multiprocessor systems is coherency. Cache coherence is the discipline that ensures that changes in the values of shared data items are propagated throughout the system in a timely fashion. Cache coherency may be implemented in hardware, software, or a combination of hardware and software. Reference herein to a “coherent cache system” (or simply “coherent cache”) is one that implements cache coherency primarily through a hardware-oriented approach. Reference herein to a “non-coherent cache system” (or simply “non-coherent cache”) is one where software implements coherency among the caches of the system.
Numerous schemes have been proposed both in academia and in industry about how to implement scalable coherent caches. However, large-scale coherent caches are complicated, expensive, and power intensive. Also, it is not clear if coherent caches are scalable as the hardware must ensure the coherence between data items on multiple cache locations at any moment.
In contrast, non-coherent caches do not provide any hardware support for coherence and store potentially stale data. In non-coherent caches, software is required to ensure that stale data is not incorrectly accessed. Although this approach greatly simplifies cache design complexity and power consumption of the cache hardware, this approach adds certain performance overheads to the software side.
For an instance, a typical critical section of a parallel software implementation generically appears as follows:
BEGIN_CRITICAL_SECTION()   some_loop {       random_read_of_shared_data       do_local_computation()       random_write_of_shared_data()   }END_CRITICAL_SECTION()
When such a parallel software implementation is ported to a system with non-coherent caches, some cache operations are added in order to ensure the correctness of the program. Specifically, the software should “invalidate” a cache at the beginning of each critical section. “Cache invalidation” is the process of deleting cache entries. Cache invalidation is performed because a particular cache might be holding (or storing) “stale” data items, or data items that have been updated in other cores but have not yet been updated in the particular cache. Similarly, at the end of each critical section, the software should flush all the “dirty” entries (or entries that contain data that has been modified but not yet reflected in shared memory) in a cache to make sure that the modified data is visible to other cores.
However, the cache operations of invalidating and flushing add significant performance overhead to software execution for at least the following two reasons. First, data access after cache invalidation induces cache misses. There are three kinds of cache misses: instruction read miss, data read miss, and data write miss. Cache misses may be introduced for heavily used data stored on a stack. Second, a cache flush requires a significant amount of time because every cache entry has to be examined and to be flushed if it holds dirty data.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.