1. Field
This disclosure relates generally to improving processor performance and, more specifically, to techniques for utilizing translation lookaside buffer entry numbers to improve processor performance.
2. Related Art
Content-addressable memory (CAM) is a special type of memory used in certain high-speed searching applications. Unlike random access memory (RAM), in which a memory address is supplied to a RAM and the RAM returns a data word stored at the address, a CAM is designed such that a memory address is supplied and the CAM searches its entire content to see if the supplied memory address is stored anywhere in the CAM. If the address is found, the CAM returns a data word stored at the address. Because a CAM is designed to search its entire memory in a single operation, it is usually more flexible than RAM. There are, however, cost disadvantages to CAM.
Unlike RAM, which has simple storage cells, each individual memory bit in a fully parallel CAM has its own associated comparison circuit to detect a match between stored bits and input bits. Additionally, match outputs from each cell in the address must be combined to yield a complete address match signal. The additional circuitry increases the physical size of the CAM chip which increases manufacturing cost. The extra CAM circuitry also increases power dissipation since every comparison circuit is active on every access. Consequently, CAM is usually only used in specialized applications where searching speed cannot be accomplished using a less costly method.
A translation lookaside buffer (TLB) may take the form of a cache memory (cache) that memory management hardware may use to improve virtual address translation speed. In general, processors use a TLB to map virtual addresses to physical addresses. A TLB may be implemented as a CAM. In this case, a TLB search key corresponds to a virtual address and the search result, when a TLB hit occurs, is a physical address. If the requested address is present in the TLB (referred to as a TLB hit), the CAM search yields a physical address that may be used to access memory. If the requested address is not in the TLB (referred to as a TLB miss), a process called a page walk may be initiated to determine the physical address. In a typical situation, after the physical address is determined by the page walk, the new virtual address to physical address mapping is entered into the TLB. In general, a TLB has a fixed number of slots that contain page table entries, which map virtual addresses to physical addresses.
A TLB may reside between a processor and cache, between a cache and main memory, between levels of a multi-level cache, or in another location within a processor (e.g., at an input of a store buffer). The TLB placement determines whether a cache uses physical or virtual addressing. If a cache is virtually addressed, requests are sent directly from the processor to the cache and the TLB is accessed only on a cache miss. If the cache is physically addressed, a TLB look-up is performed for every memory operation and the resulting physical address is sent to the cache. Physical addresses are used in both tag arrays for caches and store buffers.
Sequential consistency of out-of-order execution of load and store instructions in a software program requires physical address checking between load and store instructions for proper ordering of data. Increasing pipeline depth and the number of parallel dispatched instructions usually requires increasing a size of an associated store buffer. As memory space increases, address size increases and the number of comparators in a conventional load/store buffer also increases. Store-to-load data forwarding is often required for high-performance processor designs. In general, as memory space has increased, the time required for address comparison for store-to-load data forwarding has increased and associated power dissipation has increased. In most processor designs, store-to-load data forwarding has been a speed path, as the time for load virtual address translation to a load physical address (which is then compared to store physical addresses in a store buffer) increases with increasing memory space. As is known, when a load address matches a store address in a store buffer, data associated with the matched store address is forwarded from the store buffer to the load instruction.
In a Harvard architecture (or a hybrid thereof), a separate virtual address space or memory access hardware may exist for instructions and data. This can lead to distinct TLBs for each access type. A common optimization for physically addressed caches is to perform a TLB look-up in parallel with a cache access. Low-order bits of any virtual address (e.g., the lower 12 bits of the virtual address in a memory system having 4 kilobyte (kB) pages) represent the offset of the desired address within a page and, as such, do not change in the virtual-to-physical address translation. During a cache access, two steps are performed, i.e., an index is used to find a cache entry and the tags for the cache line are compared. If the cache is structured in such a way that the cache can be indexed using only the bits that do not change in translation, the cache can perform its “index” operation while the TLB translates the upper bits of the address. The translated address from the TLB is then passed to the cache. The cache then performs a tag comparison to determine if the access was a hit or miss.
Cache entries usually include a data block, a tag, and flag bits. The data block includes the data fetched from main memory and the tag includes a portion of the address of the data fetched from the main memory. An instruction cache requires only one flag bit (i.e., a valid bit) per cache row entry. The valid bit indicates whether or not a cache block has been loaded with valid data. A data cache typically requires at least two flag bits (i.e., a valid bit and a dirty bit) per cache row entry. The dirty bit indicates whether the cache block is unchanged since it was read from main memory (i.e., whether the cache block is “clean”) or whether the processor has written data to that block and the new value has not yet made it all the way to main memory (i.e., whether any data in the cache block has been marked since it was fetched from main memory or is “dirty”).
An implemented replacement policy dictates where in a cache a copy of a particular entry of main memory is stored. If the replacement policy is free to choose any entry in the cache to hold the copy, the cache is a fully-associative cache. At the other extreme, if each entry in main memory can be stored in just one location in a cache, the cache is a direct mapped cache. Many caches implement a compromise in which each entry in main memory can go to any one of ‘N’ places in the cache (referred to as an N-way set associative). In general, associativity is a trade-off. For example, if there are four places to which a replacement policy could have mapped a memory location, then to check if that location is in the cache four cache entries must be searched. It should be appreciated that checking more cache locations usually requires more power, chip area, and time. On the other hand, caches with more associativity suffer fewer misses, so that a processor wastes less time reading from main memory.