1. Field of the Invention
The present invention is directed to microprocessor architectures. More particularly, the invention is directed to TLBs and cache memories for speeding processor access to main memory in microprocessor systems. Even more particularly, the invention is directed to methods and apparatuses for implementing novel refill policies for multi-way set associative caches and TLBs.
2. Background of the Related Art
Caches and Translation Lookaside Buffers (TLBs) are ubiquitous in microprocessor design. For general information on such microprocessor structures, see J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitive Approach (1996), Chapter 5.
Generally, the speed at which a microprocessor (e.g. a CPU) operates depends on the rate at which instructions and operands are transferred between memory and the CPU. As shown in FIG. 1, a cache 110 is a relatively small random access memory (RAM) used to store a copy of memory data in anticipation of future use by the CPU 120. Typically, the cache 110 is positioned between the CPU 120 and the main memory 130 as shown in FIG. 1, to intercept calls from the CPU 120 to the main memory 130. When the data is needed, it can quickly be retrieved from the cache 110, rather than obtaining it from the slow main memory 130.
A cache may be implemented by one or more RAM integrated circuits. For very high speed caches, the RAM is usually an integral part of the CPU chip. The data stored in a cache can be transferred to the CPU in substantially less time than data stored in main memory.
A translation look-aside buffer (TLB) 140 is a special form of cache that is used to store portions of a page table (which may or may not be stored in main memory 130). As is known, the page table translates virtual page numbers into physical page numbers. TLB 140 is typically organized to hold only a single entry per tag (each TLB entry comprising, for example, a physical page number, permissions for access, etc.). In contrast, cache 110 is typically organized into a plurality of blocks, wherein each block has a corresponding tag and stores a copy of one or more contiguously addressable bytes of memory data.
In order to access data in the cache 110, the virtual memory address is broken down into a cache address as shown in FIG. 2. The portion of the cache address including the most significant bits of the memory address is called the tag 240, and the portion including the least significant bits is called the cache index 250. The cache index 250 corresponds to the address of the block storing a copy of the referenced data, and additional bits (i.e. offset 260) are usually used to address the bytes within a block, if each block has more than one byte of data. The tag 240 is used to uniquely identify blocks having different memory addresses but the same cache index 250. Therefore, the cache 110 typically includes a data store and a tag store. The data store is used for storing the blocks 270 of data. The tag store, sometimes known as the directory, is used for storing the tags 240 corresponding to each of the blocks 270 of data. Both the data store and the tag store are accessed by the cache index 250. The output of the data store is a block 270 of data, and the output of the tag store is a tag 240.
There are different types of caches, ranging from direct-mapped caches, where a block can appear in only one place in the cache 110, to fully-associative caches where a block can appear in any place in the cache 110. In between these extremes is another type of cache called a multi-Way set-associative cache wherein two or more concurrently addressable RAMs can cache a plurality of blocks 270 and tags 240 for a single cache index 250. That is, in a conventional N-Way set-associative cache, the single cache index 250 is used to concurrently access a plurality N of blocks 270 and tags 240 in a set of N RAMs. The number of RAMs in the set indicates the Way number of the cache. For example, if the cache index 250 is used to concurrently address data and tags 240 stored in two RAMs, the cache is a two-Way set-associative cache.
As shown in FIG. 2, during the operation of a single-index multi-Way set-associative cache, a memory access by the CPU causes each of the RAMs 1 to N to be examined at the corresponding cache index location. The tag is used to distinguish the cache blocks having the same cache index but different memory addresses. If a tag comparison indicates that the desired data are stored in a cache block of one of the RAMs, that RAM is selected and the desired access is completed. It should be noted that caches are generally indexed with a virtual address and tagged with a physical address.
A multi-Way set-associative cache provides the advantage that there are two or more possible locations for storing data in blocks having the same cache index. This arrangement reduces thrashing due to hot spots in memory and increases the operating speed of the computer system if the hot spots are uniformly distributed over the blocks of RAM.
As further shown in FIG. 2, simultaneously with an access to cache 110, an access to TLB 140 can be made to translate the virtual address into a physical address. It should be noted that, although FIG. 2 shows the virtual page number comprising the same bits as tag 240 and index 250 combined, that this is not necessary, and in fact the bit ranges for the different fields may be different. It should be further noted that the page offset and the offset 260 may also comprise different bit ranges.
Although not shown in detail in FIG. 2, TLBs can also be implemented using a range from direct-mapped to fully associative types of caches. In particular, the TLBs that implement Xtensa MMU from Tensilica, Inc. (see co-pending application Ser. No. 10/213,370; and the Xtensa ISA) are set-associative memories that cache entries from the page table. These caches are implemented with logic synthesis of standard cells and can make use of heterogenous ways (i.e. different ways may have different sizes). As described in the co-pending application, the Xtensa MMU includes a feature called Variable Page Sizes. There are a couple of things that make this happen. First, at configuration time, each way can be configured to support some different page sizes. Hardware is generated to support all of the page sizes configured. At run time, the operating system will program each way with a single page size it is translating at any given time. In one example implementation, a special runtime configuration register is provided that allows each way to be programmed by the operating system to perform translations for a certain page size.
Due to this novel feature, different access patterns happen because either the ways have different numbers of indices, the ways are translating different page sizes, or both. For example, assume there is a way that is four entries and can support 4 kB or 4 MB pages. If it is programmed to translate 4 kB pages, then the index would be VirtAddr[13:12]. If it were programmed to translate 4 MB pages, the index would be VirtAddr[23:22]. Now, assume there are four of these ways. At any given time, some of them may be programmed to translate 4 kB pages, and others may be programmed to translate 4 MB pages.
In case of a cache miss (in either cache 110 and/or TLB 140), a determination is made to select one of the blocks/entries for replacement. Methods of implementing a replacement strategy for data in a cache are known in cache design. Typically, the replacement of cache entries are done in a least recently used (LRU) manner, in which the least recently used block is replaced. A more flexible strategy is the not most recently used (NMRU), which chooses a block among all those not most recently used for replacement. Blocks may also be selected at random for replacement. Other possible strategies include pseudo-LRU (an approximation of true-LRU that is more easily implemented in hardware); Least Recently Filled; and a clock algorithm used by software for managing replacements of pages in a page table.
Thus, when a set-associative cache or TLB xe2x80x9cmisses,xe2x80x9d it needs to be refilled from memory. The data retrieved from memory will be stored in an entry chosen from the ways. The replacement algorithm (e.g. LRU, NMRU, LRF, etc.) is used to decide exactly which way""s entry will get replaced. The replacement algorithm can have adverse affects on processor performance by making bad choices for replacement. This affects the cache""s xe2x80x9chit rate.xe2x80x9d For instance, replacing data which will be used soon is worse than replacing data that will not be used again, because the first choice would cause another xe2x80x9cmiss,xe2x80x9d whereas the second choice would not. Further, when the TLB is refilled from the Page Table, the replacement policy should take care to place the PTE in an appropriate way (i.e., inspect the associated configuration register and place the PTE in one of the ways, if any, that has been programmed to translate its page size).
Although set-associative memories with heterogenous ways can provide value over traditional set-associative memories with homogenous ways, replacement algorithms that work on a xe2x80x9csetxe2x80x9d basis either no longer work, are inefficient, or are difficult to implement. The primary cause of this is due to the ever-changing nature of access patterns in the set-associative memory as mentioned above. Consider the TLB presented in the above example vs. a homogenous 4 way 4 entry TLB. The TLB in the previous example has 8 unique access patterns, whereas the TLB with homogeneous ways does not (this TLB will read out the entry from the same index in each way).
In particular, current replacement algorithms are ill suited towards, or inefficient at one or more of the following: (1) handling heterogenous ways that can even allow the indexing of ways to change at run-time (i.e. Is the way configured to translate 4 KB or 4 MB pages); (2) handling way replacement criteria (i.e. Is the way configured to translate 4 KB or 4 MB pages); and (3) handling associative structures that do not have 2**N ways.
For example, algorithms such as NMRU, LRF, and pseudo-LRU are usually implemented with homogenous set-associative structure with only 1 piece of replacement information being stored per set. This replacement information could be log2(number of ways) bits for an LRF algorithm that just stores a pointer to the last way that was filled. It is difficult to modify this basic premise (of replacement information needed on a set basis) to cover the robust nature of the set-associative structures such as the TLBs presented earlier, which have different numbers of entries per way, different indexing of each way (at run-time), and different considerations for replacement. Most LRU implementations have similar issues, since they implement state that tracks LRU on a set basis.
A set-associative structure replacement algorithm is particularly beneficial for irregular set-associative structures which may be affected by different access patterns, and different associativities available to be replaced on any given access. According to certain aspects, the present invention includes methods and apparatuses that implement a novel decay replacement algorithm that is particularly beneficial for irregular set-associative structures. An embodiment of the present invention includes set-associative structures having decay information stored therein, as well as update/replacement logic to implement replacement algorithms for translation lookup buffers (TLBs) and caches that vary in the number of associativities; have unbalanced associativity sizes, e.g., associativities can have different numbers of indices; and can have varying replacement criteria. The implementation provides good performance, on the level of LRU, random and clock algorithms; and is efficient and scalable.