This invention relates to cache logic for generating a cache address, to a cache memory system and to a method for generating a cache address.
Cache memories are widely used as part of a memory hierarchy to increase the performance of data processing systems by reducing the latency associated with main memory accesses and/or the bandwidth consumed by those accesses. As described in Hennessy and Patterson's “Computer Architecture. A Quantitative Approach. (Fifth Edition)”, Section 2.1, the principal of locality, both temporal and spatial, means that storing some subset of the previously accessed (e.g. the most recently accessed) data in a relatively small cache memory, that is capable of being accessed with a lower latency and/or higher bandwidth than the main memory (in which the original data is stored), permits a cost effective means of achieving faster performance of the overall memory system.
In order to ensure that cached data can be readily retrieved from the cache memory, the cached data must be organised according to a defined scheme. Following the nomenclature of Hennessy and Patterson, we define clines' (AKA ‘blocks’) to be a predefined number of one or more words of memory on aligned boundaries. Hennessy and Patterson note that the most popular scheme is “set associative” whereby “set” is a group of physical storage lines in the cache. A line in memory is first mapped onto a set and then that line can be placed in any one of that set's lines. Further, in Hennessy and Patterson, the set is chosen with the trivial (for powers of 2) mappingChosen_set=(Line Address)MOD(Number of sets in cache).
When there are n lines in a set, the cache scheme is referred to as ‘n-way set associative’. The logical extremes have particular names: the case where n=1 is referred to as a “direct-mapped” cache. This is the least expensive to implement but has relatively low performance. (Some factors that determine performance will be discussed shortly). The other extreme, in which there is a single set containing all the storage lines of the cache and thus any memory line may be placed anywhere in the cache, is called a “fully associative cache”. This has the best performance but is typically very expensive to implement.
Although some systems do use direct-mapped caches, more typically, n is chosen to be, say, in the range [2,16], providing a balance between performance and cost of implementation.
With regards to performance of a cache, one factor that influences performance is ‘miss rate’ which is, trivially, the number of accesses to the cache for data that are not present in the cache divided by the total number of accesses. Hennessey and Patterson categorise ‘misses’ into three classes: Compulsory, Capacity and Conflict. Compulsory misses are unavoidable and correspond the first ever access of any line. Capacity misses typically depend on the physical size of the cache, while conflict misses are due to the replacements within a set and depend on the associativity of the sets and, in a sense, the data addresses being accessed.
Typically the data for caching is held at a resource identified by an address, such as a memory address, or a hardware resource address in a computing system. The access of such resource addresses can follow certain fixed patterns (such as accesses of sequential memory blocks or of strided accesses of 2D data structures), and care may therefore be taken to ensure that data representing related (e.g. sequential or strided) resource addresses are well-distributed over the cache memory so as to avoid pathological access cases. A sufficiently long sequence of addresses that maps to only a few cache sets is an example of a pathological access case, resulting in a high conflict miss rate since repeatedly accessing the same few cache sets may lead to a significant number of cache line replacements. Even though the compulsory or capacity miss rates may be low, such poor distribution characteristics severely affect the performance of a cache memory. Although implementing a higher cache associativity can reduce the conflict misses, it comes at a power and area cost.
It should be noted that, since the mapping of memory block address to cache sets is a “many to one” relationship, when block data is resident in the cache, it is necessary to store additional data, sometimes referred to as a tag, to identify the original memory location of the block. It is clearly desirable to use as few bits as possible in the tags. With the simple “Chosen Set” mapping scheme described above by Hennessey and Patterson, given a system in which there are 2S sets in the cache and in which the memory has 2T+S lines of data, then the tag requires T bits of data. In this scheme it is thus unnecessary to store the full address of the line in the tag as the S least significant bits can be inferred from their physical location in the cache.
Hennessey and Patterson also discuss methods of increasing the cache bandwidth through use of multi-banked caches. The proposal there is to split the cache into 2k banks (k being in the range, say, [1, 3]) and assigning line address, j, to bank, (j mod 2k). Hennessey and Patterson state this simple mapping “works well”, but it can easily suffer from pathological access patterns in the same manner as the high conflict miss rate described previously.
Several mechanisms have been proposed in the art for improving the distribution of data over a cache and maximising the cache hit rate for a given size of cache. These mechanisms typically use complex hash functions to inject pseudorandom variation into the mapping of data onto cache addresses. For example, M. Schlansker et al. describe the use of a complex hash function to randomise the placement of data in a cache in their paper “Randomization and Associativity in the Design of Placement-Insensitive Caches”, Computer Systems Laboratory, HPL-93-41, June 1993. They propose the mapping:Chosen_set=((LineAddress*LineAddress*174773)/221)MOD(Number_of_sets_in_cache).
Although this does produce a good distribution when mapping addresses to sets, it suffers from two major drawbacks. The first is that such complex hash functions are generally too expensive in terms of silicon area and latency to be implemented in the critical path of a high speed cache memory system. The second drawback is that the cache tag store needs to keep the entire line address (and not just a subset thereof) as there appears to be no practical method to ‘reverse’ the hash mapping. The requirement to store a larger tag thus adds considerably to the cost of the cache.
Other mechanisms for improving cache performance by reducing conflict misses include using a skewed-associative cache architecture. Such architectures are described in “A case for two-way skewed-associative caches”, A. Seznec, Proceedings of the 20th International Symposium on Computer Architecture, San Diego, May 1993 and in “Trade-offs for Skewed-Associative Caches”, H. Vandierendonck and K. De Bosschere (a paper published by Dept. of Electronics and Information Systems, Ghent University). Skewed-associative architectures require multiple cache banks because the mechanism achieves low miss rates through inter-bank dispersion of cache addresses. However, as is observed in the Vandierendonck paper, high performance is achieved by combining a skewed-associative architecture with complex hash functions to inject random character into the mapping of data into the cache. Again, such complex hash functions are generally too expensive to implement in the critical path of a high speed cache memory system.
Replacement policies in cache memories determine which item(s) to discard when a cache line is full. An example is a least recently used (LRU) replacement policy according to which the least recently used item is discarded from a cache line. A further limitation of a Skewed-Associative cache is the constrained set of implementable replacement policies. For example, Seznec states “Unfortunately, we have not been able to find concise information to associate with a cache line which would allow a simple hardware implementation of a LRU replacement policy on a skewed-associative cache.” though he does propose a two-way pseudo-LRU scheme that “may work fine”. Further examples of possible replacement policies are discussed in Hennessey and Patterson.
Complex hash functions may be replaced with simple, low-complexity functions for distributing resource addresses over a cache memory. However, simple functions do not normally provide the desired distribution characteristics which avoid pathological access cases.