It is known to provide multi-processing systems in which two or more processing units, for example processor cores, share access to shared memory. Such systems are typically used to gain higher performance by arranging the different processor cores to execute respective data processing operations in parallel. Known data processing systems which provide such multi-processing capabilities include IBM370 systems and SPARC multi-processing systems. These particular multi-processing systems are high performance systems where power efficiency and power consumption is of little concern and the main objective is maximum processing speed.
To further improve speed of access to data within such a multi-processing system, it is known to provide each of the processing units with its own local cache in which to store a subset of the data held in the shared memory. Whilst this can improve speed of access to data, it complicates the issue of data coherency. In particular, it will be appreciated that if a particular processor performs a write operation with regards to a data value held in its local cache, that data value will be updated locally within the cache, but may not necessarily also be updated at the same time in the shared memory. In particular, if the data value in question relates to a write back region of memory, then the updated data value in the cache will only be stored back to the shared memory when that data value is subsequently evicted from the cache.
Since the data may be shared with other processors, it is important to ensure that those processors will access the up-to-date data when seeking to access the associated address in shared memory. To ensure that this happens, it is known to employ a cache coherency protocol within the multi-processing system to ensure that if a particular processor updates a data value held in its local cache, that up-to-date data will be made available to any other processor subsequently requesting access to that data.
In accordance with a typical cache coherency protocol, certain accesses performed by a processor will require a coherency operation to be performed. The coherency operation will cause a notification (also referred to herein as a snoop request) to be sent to the other processors identifying the type of access taking place and the address being accessed. This will cause those other processors to perform certain actions defined by the cache coherency protocol, and may also in certain instances result in certain information being fed back from one or more of those processors to the processor initiating the access requiring the coherency operation. By such a technique, the coherency of the data held in the various local caches is maintained, ensuring that each processor accesses up-to-date data. One such cache coherency protocol is the “Modified, Exclusive, Shared, Invalid” (MESI) cache coherency protocol.
If a particular piece of data can be guaranteed to be exclusively used by only one of the processors, then when that processor accesses that data, a coherency operation will not be required. However, in a typical multi-processing system, much of the data will be shared amongst the processors, either because the data is generally classed as shared data, or because the multi-processing system allows for the migration of processes between processors, or indeed for a particular process to be run in parallel on multiple processors, with the result that even data that is specific to a particular process cannot be guaranteed to be exclusively used by a particular processor.
Accordingly, it will be appreciated that coherency operations will be required to be performed frequently, and this will result in significant numbers of snoop requests being issued to the caches to cause those caches to perform a snoop operation in order to determine whether the data value the subject of a particular access request is or is not within those caches. Hence, by way of example, if a cache line in the cache associated with one of the processing units has its data content modified, and that data is shared, this will typically cause a coherency operation to be performed as a result of which snoop requests will be issued to all of the other caches associated with the other processing units to cause those caches to perform snoop operations. If the same cache line is stored within those caches, that copy of the cache line will either be invalidated or updated dependent on the coherency protocol being applied. However, if in fact a copy of that cache line does not reside in the cache, nothing further is required, but some energy is consumed as a result of performing the snoop operation within the cache due to the lookup performed with respect to the tag entries of the cache. Accordingly, it can be seen that the data processing apparatus consumes some considerable energy (also referred to a snoop energy) in performing snoop operations in order to find out whether a copy of a cache line exists in a cache or not. Traditionally this is done for each cache associated with each processing unit, and if the snoop hit rate is very low (i.e. a large proportion of the caches subjected to the snoop operation do not locate a copy of the cache line in question), it is clear that significant snoop energy is wasted.
Whilst energy consumption in some multi-processing systems may not be a key concern, as use of multi-processing systems becomes more widespread, there are many modern day implementations (e.g. multi-core systems) where energy consumption is very important.
A number of techniques have been developed with the aim of seeking to reduce the energy consumption associated with performing such snoop operations in a data processing apparatus having multiple processing units. For example, the article entitled “JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers” by A Moshovos et al, Proceedings of International Symposium on High Performance Computer Architecture (HPCA-7), January 2001, describes a technique for reducing the energy consumed by snoop requests in a symmetric multiprocessor (SMP) system, where a small, cache-like, structure, referred to therein as a JETTY, is introduced in between the bus and the level two cache at each SMP node. Each SMP node has local level one data and instruction caches and a level two cache. Every snoop request issued over the snoop bus to an SMP node first goes to the associated JETTY, with a lookup being performed in that JETTY to determine whether the data value in question is definitely not in the associated level two cache (and accordingly that level two cache does not need to subjected to a snoop operation), or whether there may be a copy of the data value in question in that cache (and therefore the level two cache does need to be subjected to a snoop operation). In accordance with one embodiment described, a Bloom filter mechanism is used to implement each JETTY.
Another SMP-based snoop energy reduction technique was proposed by Saldanha et al in the article “Power Efficient Cache Coherence”, Workshop on Memory Performance Issues in Conjunction with ISCA, June 2001. In this article, a similar SMP structure is described to that disclosed in the earlier JETTY article, namely having four processor nodes, each having a level two cache. The approach described in the article to reduce snoop energy is to serialise the snooping process in a hierarchical way. This technique is described in the article as “Serial Snooping”, which is only applied to read misses. In the event of a read miss, the neighbour node of the requesting processor is snooped to get the requested block of data. If that neighbour node does not have the requested block, the next node is snooped. This process is continued until a cache or the memory supplies the requested block. Two drawbacks of such an approach are that it can only reduce the volume of snoop operations resulting from read misses, and also it increases the latency of the load operation.
In the article “Evaluation of Snoop-Energy Reduction Techniques for Chip-Multiprocessors” by M Ekman et al, Workshop on Duplicating, Deconstructing and Debunking, in Conjunction with ISCA, May 2002, the above mentioned JETTY and serial snooping techniques are evaluated and it is concluded in that article that serial snooping does not manage to cut much energy because most of the time no caches will be able to respond, which means that all caches will be searched. With regard to the JETTY technique, it is observed that a significant portion of the snoop energy is cut but this saving is outweighed by the energy lost in implementing the JETTY mechanism.
The article “TLB and Snoop Energy-Reduction using Virtual Caches in Low-Power Chip-Multiprocessors”, by M. Ekman et al, ISPLED '02, August 2002, describes a snoop filtering technique for level 1 virtual caches in a Chip-Multiprocessor (CMP) system. Each cache keeps track of memory pages rather than single cache blocks shared by other caches using a table called a page sharing table (PST). When a new page is loaded into the cache, its PST table sends the physical address of the new page to all other PSTs. These PSTs check whether they share the page and acknowledge to the initiating PST if the page is shared so that the initiating PST knows who shares this page. The main disadvantage of this technique is that when a PST entry is evicted, all cache blocks belonging to the page must be flushed from the cache. In order to avoid this, the authors introduce an additional hardware mechanism, which complicates the design of the snoop filter hardware.
The article “RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence”, by A Moshovos, Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005, describes another snoop filtering technique that performs filtering at a coarser granularity, i.e. memory regions rather than cache blocks. Each memory region consists of a contiguous portion of the memory. For each SMP or CMP node, two tables are provided to provide snoop filtering, namely a Not Shared Region Table (NSRT) which is a small set-associative cache that keeps a record of regions that are not shared, and a Cache Region Hash (CRH) table which is a hash table that records regions that are locally cached. The CRH is accessed by the region address and each entry contains a counter that counts the number of the matching cache blocks in the region and a present bit. In this sense, the CRH works like the earlier-mentioned Jetty scheme. When a cache sends a snoop request to other caches, the other caches first check their own CRH to check whether they have a block in this region. If none of the caches have a block in this region, then the requesting cache allocates an entry in its NSRT meaning that it has the region but it is not shared by anyone. Next time around a snoop request will not be broadcast to other caches because the local NSRT says the region is not shared. However, later on when another cache wants to share the same region, the cache that has the region has to invalidate its NSRT entry for that region, meaning that the region is now shared. The disadvantage of this scheme is that every snoop request requires performance of a lookup in the local NSRT as well as the other NSRTs. This increases energy consumption because of the NSRT tag lookups.
Accordingly, it would be desirable to provide an improved technique for reducing energy consumption when managing cache coherency in a data processing apparatus.