1. Field of the Invention
The present invention relates to techniques for managing cache coherency in a data processing apparatus, and in particular to techniques for managing snoop operations used to achieve such cache coherency.
2. Description of the Prior Art
It is known to provide multi-processing systems in which two or more processing units, for example processor cores, share access to shared memory. Such systems are typically used to gain higher performance by arranging the different processor cores to execute respective data processing operations in parallel. Known data processing systems which provide such multi-processing capabilities include IBM370 systems and SPARC multi-processing systems. These particular multi-processing systems are high performance systems where power efficiency and power consumption is of little concern and the main objective is maximum processing speed.
To further improve speed of access to data within such a multi-processing system, it is known to provide each of the processing units with its own local cache in which to store a subset of the data held in the shared memory. Whilst this can improve speed of access to data, it complicates the issue of data coherency. In particular, it will be appreciated that if a particular processor performs a write operation with regards to a data value held in its local cache, that data value will be updated locally within the cache, but may not necessarily also be updated at the same time in the shared memory. In particular, if the data value in question relates to a write back region of memory, then the updated data value in the cache will only be stored back to the shared memory when that data value is subsequently evicted from the cache.
Since the data may be shared with other processors, it is important to ensure that those processors will access the up-to-date data when seeking to access the associated address in shared memory. To ensure that this happens, it is known to employ a cache coherency protocol within the multi-processing system to ensure that if a particular processor updates a data value held in its local cache, that up-to-date data will be made available to any other processor subsequently requesting access to that data. Similarly, if that processor reads a data value, the cache coherency protocol will ensure that the processor obtains the most up-to-date data even if that data is held in a cache local to another processor.
In accordance with a typical cache coherency protocol, certain accesses performed by a processor will require a coherency operation to be performed. Often, the coherency mechanism employs a snoop-based scheme, and in such situations the coherency operation takes the form of a snoop process during which snoop operations are performed in the caches of other processors. In particular, given the type of access taking place and the address being accessed, the caches being snooped will perform certain actions defined by the cache coherency protocol, and this may also in certain instances result in information being fed back from one or more of those caches to the processor performing the access that caused the coherency operation to be initiated. By such a technique, the coherency of the data held in the various local caches is maintained, ensuring that each processor accesses up-to-date data. One such cache coherency protocol is the “Modified, Exclusive, Shared, Invalid” (MESI) cache coherency protocol.
If a particular piece of data can be guaranteed to be exclusively used by only one of the processors, then when that processor accesses that data, a coherency operation will not be required. However, in a typical multi-processing system, much of the data will be shared amongst the processors, either because the data is generally classed as shared data, or because the multi-processing system allows for the migration of processes between processors, or indeed for a particular process to be run in parallel on multiple processors, with the result that even data that is specific to a particular process cannot be guaranteed to be exclusively used by a particular processor.
Accordingly, it will be appreciated that coherency operations will be required to be performed frequently, and this will result in significant numbers of snoop operations being performed in the caches to determine whether the data value that is the subject of a particular access request is or is not within those caches. Hence, by way of example, if a cache line in the cache associated with one of the processing units has its data content modified, and that data is shared, this will typically cause a coherency operation to be performed as a result of which snoop operations will be performed in all of the other caches associated with the other processing units. If the same cache line is stored within those caches, that copy of the cache line will either be invalidated or updated dependent on the coherency protocol being applied. However, if in fact a copy of that cache line does not reside in the cache, nothing further is required, but some energy is consumed as a result of performing the snoop operation within the cache due to the lookup performed with respect to the tag entries of the cache. Accordingly, it can be seen that the data processing apparatus consumes some considerable energy (also referred to a snoop energy) in performing snoop operations in order to find out whether a copy of a cache line exists in a cache or not. Traditionally this is done for each cache associated with each processing unit, and if the snoop hit rate is very low (i.e. a large proportion of the caches subjected to the snoop operation do not locate a copy of the cache line in question), it is clear that significant snoop energy is wasted.
Whilst energy consumption in some multi-processing systems may not be a key concern, as use of multi-processing systems becomes more widespread, there are many modern day implementations (e.g. multi-core systems) where energy consumption is very important.
A number of techniques have been developed with the aim of seeking to reduce the energy consumption associated with performing such snoop operations in a data processing apparatus having multiple processing units. For example, the article entitled “JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers” by A Moshovos et al, Proceedings of International Symposium on High Performance Computer Architecture (HPCA-7), January 2001, describes a technique for reducing the energy consumed by snoop requests in a symmetric multiprocessor (SMP) system, where a small, cache-like, structure, referred to therein as a JETTY, is introduced in between the bus and the level two cache at each SMP node. Each SMP node has local level one data and instruction caches and a level two cache. Every snoop request issued over the snoop bus to an SMP node first goes to the associated JETTY, with a lookup being performed in that JETTY to determine whether the data value in question is definitely not in the associated level two cache (and accordingly that level two cache does not need to subjected to a snoop operation), or whether there may be a copy of the data value in question in that cache (and therefore the level two cache does need to be subjected to a snoop operation). In accordance with one embodiment described, a Bloom filter mechanism is used to implement each JETTY.
Another SMP-based snoop energy reduction technique was proposed by Saldanha et al in the article “Power Efficient Cache Coherence”, Workshop on Memory Performance Issues in Conjunction with ISCA, June 2001. In this article, a similar SMP structure is described to that disclosed in the earlier JETTY article, namely having four processor nodes, each having a level two cache. The approach described in the article to reduce snoop energy is to serialise the snooping process in a hierarchical way. This technique is described in the article as “Serial Snooping”, which is only applied to read misses. In the event of a read miss, the neighbour node of the requesting processor is snooped to get the requested block of data. If that neighbour node does not have the requested block, the next node is snooped. This process is continued until a cache or the memory supplies the requested block. Two drawbacks of such an approach are that it can only reduce the volume of snoop operations resulting from read misses, and also it increases the latency of the load operation.
In the article “Evaluation of Snoop-Energy Reduction Techniques for Chip-Multiprocessors” by M Ekman et al, Workshop on Duplicating, Deconstructing and Debunking, in Conjunction with ISCA, May 2002, the above mentioned JETTY and serial snooping techniques are evaluated and it is concluded in that article that serial snooping does not manage to cut much energy because most of the time no caches will be able to respond, which means that all caches will be searched. With regard to the JETTY technique, it is observed that a significant portion of the snoop energy is cut but this saving is outweighed by the energy lost in implementing the JETTY mechanism.
The article “A Power-Aware Prediction-Based Cache Coherence Protocol for Chip Multiprocessors” by E Atoofian et al, IPDPS, page 343, 2007 IEEE International Parallel and Distributed Processing Symposium, 2007, describes a technique where each cache has a single entry predictor identifying which cache had the data most recently accessed by its associated processor, together with the number of consecutive hits from that cache. Initially, snoop operations are issued in parallel to all caches, but if the number of consecutive hits is above a threshold, then an initial snoop is performed only in the cache identified in the single entry predictor, with a fully parallel snoop of the remaining caches then being initiated if the initial snoop misses. The scheme can provide a reduction in snoop energy in systems where there is high correlation between temporal locality and hit probability in a given cache. Such a system would, for example, be one where if a cache miss occurs in the cache of processor A, and the required data is residing in and provided by processor B's local cache, there is a high probability that next time processor A has a cache miss in its local cache, that data will again be provided by processor B's local cache. However, such a scheme lacks generality or flexibility, and will provide little benefit in many systems.
The article “Multicast Snooping: A New Coherence Method Using a Multicast Address Network” by E Bilir et al, in Proceedings of the 26th Annual International Symposium on Computer Architecture, Atlanta, Ga., May 1999, describes a coherence method called “multicast snooping” that dynamically adapts between broadcast snooping and a directory protocol. In accordance with this scheme, each coherence transaction leaving a processor is accompanied by a multicast mask that specifies which processors should snoop the transaction. Masks are generated using prediction and need not be correct. A simplified directory in memory then checks the mask of each transaction, detecting masks that omit necessary processors, and taking corrective action. Such an approach is costly in terms of the need to support both snoop based and directory based coherency schemes, and will impact performance in situations where the original predicted mask is incorrect.
Various prior art techniques have been developed in the area of “snoop filtering”, where the techniques attempt to accurately identify any caches which do not need to be subjected to the snoop, such that the number of snoop operations can be reduced. For example, the article “TLB and Snoop Energy-Reduction using Virtual Caches in Low-Power Chip-Multiprocessors” by M Ekman et al, ISPLED'02, August 2002, describes a snoop filtering technique for level one virtual caches in a Chip-Multiprocessor (CMP) system. Further, the article “RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence” by A Moshovos, Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005, describes another snoop filtering technique that performs filtering at a coarser granularity, i.e. memory regions rather than cache blocks.
In addition, co-pending, commonly owned, U.S. patent application Ser. No. 11/454,834 describes a scheme where masks are maintained in association with each cache identifying, for the process currently being executed by that cache's processor, those other processors that may also have data associated with that process, such that when a snoop operation is required, the number of caches which need to be subjected to that snoop operation can be reduced.
Co-pending, commonly owned, U.S. patent application Ser. No. 11/709,279 describes a scheme where cache coherency circuitry takes advantage of information already held by indication circuitry provided for each cache and used to reduce the number of segments of the cache subjected to cache lookup operations. In particular, the cache coherency circuitry has snoop indication circuitry associated therewith whose content is derived from the segment filtering data of each indication circuitry. When snoop operations are required, the snoop indication circuitry is then referenced to determine which caches need to be subjected to a snoop operation.
Whilst significant development has been undertaking in the area of snoop filtering, it is still likely that unnecessary snoop operations will be performed even if snoop filtering techniques are used. Accordingly, irrespective of whether snoop filtering is used or not, it would still be desirable to provide an improved technique for managing snoop operations, and in particular one which has the potential to further reduce the energy consumed when performing snoop operations.
The energy consumption issue is becoming more significant with recent developments in interconnect technology. In particular, traditional snoop-based cache coherency solutions employed a shared bus interconnecting the snooped caches, which allowed for the cheap broadcast of snoop requests to all participating caches. However, with more modern System-on-Chip (SoC) architectures, shared buses are no longer desirable for connecting physically and logically disparate cache components, and these are instead often replaced with point-to-point request/response channels. A side effect of this is that broadcasting snoop requests to all participating caches incurs significant power consumption. However, the parallel snooping of multiple caches is optimal in terms of performance.
Accordingly, it would be desirable to develop a technique which provided flexibility with respect to the handling of snoop operations, allowing a reduction in energy consumption whilst retaining the ability to achieve high performance.