1. Field of the Invention
The present invention relates to data processing apparatuses which comprise a cache for storing copies of data values. More particularly, this invention relates to such caches which store tag values corresponding to stored data values and are configured to access the tag values and data values in parallel.
2. Description of the Prior Art
Providing at least one level of cache storage in which copies of data values which are used by a data processing apparatus are stored is a known performance enhancing technique in the field of data processing. Accessing a cache fundamentally comprises two actions—performing a lookup to determine if the desired data value (typically referenced by an address) is currently stored in the cache and accessing that desired data value if it is indeed found to be currently stored in the cache.
Depending on the role of a given cache, the lookup and the data access may be performed sequentially, that is to say a data access is only made once it is established that the desired data value is stored in the cache. Alternatively, in order to reduce the latency of the cache, a number of possible data values (for example a subset selected with reference to only a part of the corresponding address) may be read out whilst the lookup is being performed, and if the lookup confirms that one of those possible data values is the desired data value it is swiftly output. This latter parallel style of access comes at the cost of the extra energy expended in reading out several incorrect data values which are discarded.
An associative cache is configured such that a given data value may be stored in a number of locations in the cache, those locations being determined with reference to part of the address associated with that given data value (and typically referred to as the “index” portion of the address). This number of locations is often provided by subdividing the cache into a number of “ways”, wherein each way provides a possible storage location. An associative cache is also commonly configured such that an identifying portion of the address (typically referred to as the “tag” portion of the address) is stored in one of several possible locations identified by the index, whilst the data value itself is stored in a corresponding one of several possible locations identified by the index. Equivalently to the arrangement of the data storage, the number of storage locations for the tag portion is often provided as a number of ways.
As mentioned above, in some caches, for example in latency-sensitive L1 caches, the tag and data access typically proceed in parallel. Under parallel access, all ways of both the tag and data array are read concurrently, and the tag match signal controls an output multiplexer which steers the correct data from the matching data way to the output. Parallel access minimizes latency at the expense of energy efficiency.
By contrast in other caches, such as L2 or last level (LLC) caches, often the associativity is high and the sub-arrays tend to be much larger or more numerous. Hence L2 and LLC cache designs typically instead perform tag and data accesses in series. The tag arrays, which are comparatively narrow relative to the data arrays, are read first and compared to the incoming tag. Thereafter only the single data sub-array containing the desired data is accessed, avoiding the wasted energy of accessing all other ways. Decoupling tag and data accesses also facilitates handling of coherence traffic that must access only the cache tags.
In order to maintain the low energy expenditure of a sequential access procedure, but to improve the access time of a sequential tag-data access cache, it is known to implement “way prediction”, wherein a separate structure is used to predict which way of an associative cache will be accessed before the access is initiated. For example, it is known to speculatively set the multiplexer paths at the output of the cache data arrays to select the most-recently-used (MRU) way before the tag comparison is complete. Other approaches have suggested using sources besides the replacement order, such as register or instruction addresses, to index a prediction table to determine which way to access first in sequential associative caches (which access cache ways in consecutive cycles). However, these techniques for way prediction can struggle to provide accurate predictions under some circumstances. For example, whilst MRU techniques (relying on temporal locality) or PC-indexed prediction tables (relying on instruction correlated locality) can be of benefit in L1 caches, such predictive techniques are typically less effective in higher level caches (i.e. in L2 or higher caches), in particular where the cache is shared between multiple processor cores, because temporal locality is filtered by the L1 caches, accesses from multiple cores interleave, and instruction addresses are typically not available, restricting the applicability of these prediction mechanisms.
Another approach to improving the access time of a sequential tag-data access cache is partial tag matching and way filtering. According to this technique, when the cache access address is available, the low-order bits of each stored tag can be compared to an incoming tag (“partial tag matching”) to quickly rule out ways that cannot contain the requested cache block. Partial tag matching was first suggested to reduce the number of tags that must be scanned sequentially in early set-associative caches, when tag comparators were expensive. More recently, partial tag matching has been suggested in various forms as a means to reduce the energy requirements of tag comparison in sequential tag-data caches.
In the context of parallel tag-data caches, partial tag matching can be used to implement way filtering by inhibiting access to ways that mismatch. Prior art designs that use partial tag matching in this manner have targeted L1 caches, as parallel access is not typically used in later level caches such as LLCs. A particular challenge in such designs is engineering the partial tag comparison such that it has minimal impact on the data array critical path while saving as much energy as possible.
R. Min, Z. Xu, Y. Hu, and W.-B. Jone in “Partial Tag Comparison: A New Technology for Power-Efficient Set-Associative Cache Designs”, In Proceedings of the 17th International Conference on VLSI Design, 2004 propose using a partial tag match to gate sense amplification and bit line multiplexing for those ways which can be ruled out, which allows most of the data array access time for the comparison, but cannot save the energy consumed in driving the word and bit lines. C. Zhang, F. Vahid; J. Yang, and W. Najjar in “A way-halting cache for low-energy high-performance systems”, ACM Transactions on Architecture and Code Optimization, 2(1), 2005 propose conserving nearly all of the data array access energy when accessing a relatively small (8 kB) low associativity (4-way) L1 cache by performing the partial tag comparison in parallel with wordline decode, and then gating wordline activation in those ways which can be ruled out.
However, whilst partial tag matching and way filtering can improve the speed of a cache access, known forms of these techniques also have their drawbacks, since they may struggle in some cache contexts. For example, in higher level caches where associativity may be greater, the sub-arrays may be larger, and the locality is reduced (such as in a server-class LLC), implementing these known partial tag matching and way filtering techniques implies the need for a more complex (e.g. wider or MRU-ordered) partial-tag comparison, which can easily compromise the desired energy efficiency and timeliness.
Accordingly it would be desirable to provide an improved cache which is able to operate in a greater range of operational contexts whilst maintaining the energy efficiency of known way filtering and way prediction techniques, without significantly affecting the latency.