It is known to provide multi-processing systems in which two or more master devices, for example processor cores, share access to shared memory. Such systems are typically used to gain higher performance by arranging the different processor cores to execute respective data processing operations in parallel. Known data processing systems which provide such multi-processing capabilities include IBM370 systems and SPARC multi-processing systems. These particular multi-processing systems are high performance systems where power efficiency and power consumption is of little concern and the main objective is maximum processing speed.
It should be noted that the various master devices need not be processor cores. For example, a multi-processing system may include one or more processor cores, along with other master devices such as a Direct Memory Access (DMA) controller, a graphics processing unit (GPU), etc.
To further improve speed of access to data within such multi-processing systems, it is known to provide one or more of the master devices with their own local cache in which to store a subset of the data held in the shared memory. Whilst this can improve speed of access to data, it complicates the issue of data coherency. In particular, it will be appreciated that if a particular master device performs a write operation with regards to a data value held in its local cache, that data value will be updated locally within the cache, but may not necessarily also be updated at the same time in the shared memory. In particular, if the data value in question relates to a write back region of memory, then the updated data value in the cache will only be stored back to the shared memory when that data value is subsequently evicted from the cache.
Since the data may be shared with other master devices, it is important to ensure that those master devices will access the up-to-date data when seeking to access the associated address in shared memory. To ensure that this happens, it is known to employ a cache coherency protocol within the multi-processing system to ensure that if a particular master device updates a data value held in its local cache, that up-to-date data will be made available to any other master device subsequently requesting access to that data.
The use of such cache coherency protocols can also give rise to power consumption benefits by avoiding the need for accesses to memory in situations where data required by a master device can be found within one of the caches, and hence accessed without needing to access memory.
In accordance with a typical cache coherency protocol, certain accesses performed by a master device will require a coherency operation to be performed. The coherency operation will cause a coherency request to be sent to the other master devices identifying the type of access taking place and the address being accessed. This will cause those other master devices to perform certain coherency actions defined by the cache coherency protocol, and may also in certain instances result in certain information being fed back from one or more of those master devices to the master device initiating the access requiring the coherency operation. By such a technique, the coherency of the data held in the various local caches is maintained, ensuring that each master device accesses up-to-date data. One such cache coherency protocol is the “Modified, Owned, Exclusive, Shared, Invalid” (MOESI) cache coherency protocol.
Cache coherency protocols will typically implement either a write update mechanism or a write invalidate mechanism when a master device seeks to perform a write operation in respect of a data value. In accordance with the write update mechanism, any cache that currently stores the data value is arranged to update its local copy to reflect the update performed by the master device. In accordance with a write invalidate mechanism, the cache coherency hardware causes all other cached copies of the data value to be invalidated, with the result that the writing master device's cache is then the only valid copy in the system. At that point, the master device can continue to write to the locally cached version without causing coherency problems.
It is typically much simpler to implement a write invalidate mechanism than it is to implement a write update mechanism, and accordingly the write invalidate mechanism is the most commonly used cache coherency technique.
However, when using the write invalidate mechanism, this can result in data being removed from a master device's local cache before it has necessarily finished using that data. For example, if we refer to the master device producing the updated data value as the producer, and assume that there is at least one other master device that is (or will be) performing data processing operations using that data value (such a master device being referred to as a consumer hereafter), then it will be apparent that the result of the coherency operation performed on initiation of the write operation by the producer is that the consumer's local cached copy of the data value is invalidated, and the write then proceeds in respect of the producer's local cache. When the consumer then subsequently issues a request for the data value, a miss will occur in its local cache, thereby introducing a latency whilst that data is retrieved from another cache, or from a shared memory (for example if the producer has subsequently evicted the updated data from its cache back to memory before the consumer requests the data). Hence, the use of the write invalidate mechanism has the potential to impact performance of the data processing system.
One known technique for seeking to increase the hit rate in a cache involves the use of prefetch mechanisms within a master device to seek to prefetch, into that master device's local cache, data before it is actually needed during performance of the data processing operations by that master device. Such prefetch mechanisms are described, for example, in the article “Cache-Only Memory Architectures” (COMA), by F Dahlgren et al, Computer, Volume 32, Issue 6, June 1999, Pages 72-79. COMA seeks to reduce the impact of frequent long-latency memory accesses by turning memory modules into large dynamic RAM (DRAM) caches called attraction memory (AM). When a processor requests a block from a remote memory, the block is inserted in both the processor's cache and the node's AM. Because a large AM is more capable of containing the node's current working data set than a cache is, more of the cache misses are satisfied locally within the node. The article also discusses prefetching as means for retrieving data into a cache before the processor needs it. In software-controlled prefetching, the compiler or programmer inserts additional instructions in the code to perform the prefetching. Alternatively, hardware-controlled prefetching uses a mechanism that detects memory reference patterns and uses the patterns to automatically prefetch data.
Whilst such prefetching techniques could be used in the above-described producer-consumer scenario to improve the chance of the consumer's access requests hitting in its local cache, such prefetch mechanisms can significantly complicate the operation of the master device and impact the performance of that master device due to the processing overhead involved in implementing such prefetch mechanisms.
The above type of multi-processing system, where multiple master devices share data, use caches to improve performance, and use a cache coherency protocol to ensure that all of the master devices have a consistent view of the data, is often referred to as a coherent cache system. In non-coherent cache systems, where data is not shared between the various master devices, it is known to use lockdown mechanisms in individual caches in order to lockdown one or more cache lines of data to avoid that data being evicted whilst it is still needed by the associated master device. However, in coherent cache systems, such an approach is not practical. For example, when employing a write update mechanism, it would be complex to incorporate support for locked down cache lines. Furthermore, when employing the commonly used write invalidate mechanism, it would not be possible to allow cache lines to be locked down, due to the need to be able to invalidate a cache line in any cache as part of the cache coherency mechanism.
Considering the particular producer-consumer issue discussed earlier, an alternative mechanism which could reduce the latency of access to data by a consumer would be to bypass the cache system altogether, and instead to pass the data between producers and consumers using dedicated hardware, such as FIFOs. However, these would have to be sized and positioned appropriately at SoC (System-on-Chip) design time, and hence such an approach does not provide a very flexible mechanism. Further, such an approach would add to the cost and complexity of the design, and accordingly it will generally be considered more preferable to continue to use the coherent cache system as a means for sharing data between the producers and consumers.
Another known mechanism for seeking to reduce latency of access to data by a master device is “read snarfing”. In particular, when master devices have a shared data bus, then the response to a read request issued by a particular master device (say M1) is visible to another master device (say M2) in the system, since the same read data pins are provided to both M1 and M2. By such an approach M2 could populate its own cache based on the read operations performed by M1 without needing to issue a read request itself, since it can “snarf” the values that were read by M1. However, such an approach lacks flexibility, since it requires a master device utilising the snarfing approach to have visibility of the read data bus that is provided to other master devices, and requires another of those master devices to perform a read before such snarfing can take place.
Accordingly, it would be desirable to provide an improved technique for managing data within a cache forming a portion of a coherent cache system.