Translation Lookaside Buffers (TLBs)
Many modern microprocessors support the notion of virtual memory. In a virtual memory system, instructions of a program executing on the microprocessor refer to data using virtual addresses in a virtual address space of the microprocessor. Additionally, the instructions themselves are referred to using virtual addresses in the virtual address space. The virtual address space may be much larger than the actual physical memory space of the system, and in particular, the amount of virtual memory is typically much greater than the amount of physical memory present in the system. The virtual addresses generated by the microprocessor are translated into physical addresses that are used to access system memory or other devices, such as I/O devices. Typically, the physical addresses are also used to access instruction and data caches of the processor.
A common virtual memory scheme supported by microprocessors is a paged memory system. A paged memory system employs a paging mechanism for translating, or mapping, virtual addresses to physical addresses. The physical address space is divided up into physical pages of fixed size. A common page size is 4 KB. The virtual addresses comprise a virtual page address portion and a page offset portion. The virtual page address specifies a virtual page in the virtual address space. The virtual page address is translated by the paging mechanism into a physical page address. The page offset specifies a physical offset in the physical page, i.e., a physical offset from the physical page address.
The advantages of memory paging are well known. One example of a benefit of memory paging systems is that they enable programs to execute with a larger virtual memory space than the existing physical memory space. Another benefit is that memory paging facilitates relocation of programs in different physical memory locations during different or multiple executions of the program. Another benefit of memory paging is that it allows multiple processes to execute on the processor simultaneously, each having its own allocated physical memory pages to access without having to be swapped in from disk, and without having to dedicate the full physical memory to one process. Another benefit is that memory paging facilitates memory protection from other processes on a page basis.
Page translation, i.e., translation of the virtual page address to the physical page address, is accomplished by what is commonly referred to as a page table walk. Typically, the operating system maintains page tables that contain information for translating the virtual page address to a physical page address. Typically, the page tables reside in system memory. Hence, it is a relatively costly operation to perform a page table walk, since multiple memory accesses must typically be performed to do the translation. The page table walk may be performed by hardware, software, or a combination thereof.
To improve performance by reducing the number of page table walks, many microprocessors provide a mechanism for caching page table information, which includes physical page addresses translated from recently used virtual page addresses. The page table information cache is commonly referred to as a translation lookaside buffer (TLB). The virtual page address is provided to the TLB, and the TLB performs a lookup of the virtual page address. If the virtual page address hits in the TLB, then the TLB provides the corresponding translated physical page address, thereby avoiding the need to perform a page table walk to translate the virtual page address to the physical page address.
In a processor having an instruction cache that is addressed by a physical address, the virtual address of the cache line containing the next instruction to fetch must be translated into a physical address before the instruction cache line can be fetched. In order to efficiently make use of the execution units of the processor, the execution units must be constantly supplied with instructions to execute, or else pipeline bubbles will occur in which the execution units are sitting idle with no valid instructions to execute. This implies that the instruction fetch portion of the processor must fetch instructions at a high enough rate to keep the execution units supplied with instructions. This further implies that the TLB for the instruction cache must provide a high hit rate to enable the instruction cache to supply instructions at a high rate.
Multithreading
Microprocessor designers employ many techniques to increase processor performance. Most microprocessors operate using a clock signal running at a fixed frequency. Each clock cycle the circuits of the microprocessor perform their respective functions. According to Hennessy and Patterson, the true measure of a microprocessor's performance is the time required to execute a program or collection of programs. From this perspective, the performance of a microprocessor is a function of its clock frequency, the average number of clock cycles required to execute an instruction (or alternately stated, the average number of instructions executed per clock cycle), and the number of instructions executed in the program or collection of programs. Semiconductor scientists and engineers are continually making it possible for microprocessors to run at faster clock frequencies, chiefly by reducing transistor size, resulting in faster switching times. The number of instructions executed is largely fixed by the task to be performed by the program, although it is also affected by the instruction set architecture of the microprocessor. Large performance increases have been realized by architectural and organizational notions that improve the instructions per clock cycle, in particular by notions of parallelism.
One notion of parallelism that has improved the instructions per clock cycle, as well as the clock frequency of microprocessors is pipelining, which overlaps execution of multiple instructions within pipeline stages of the microprocessor. In an ideal situation, each clock cycle one instruction moves down the pipeline to a new stage, which performs a different function on the instructions. Thus, although each individual instruction takes multiple clock cycles to complete, because the multiple cycles of the individual instructions overlap, the average clocks per instruction is reduced. The performance improvements of pipelining may be realized to the extent that the instructions in the program permit it, namely to the extent that an instruction does not depend upon its predecessors in order to execute and can therefore execute in parallel with its predecessors, which is commonly referred to as instruction-level parallelism. Another way in which instruction-level parallelism is exploited by contemporary microprocessors is the issuing of multiple instructions for execution per clock cycle. These microprocessors are commonly referred to as superscalar microprocessors.
What has been discussed above pertains to parallelism at the individual instruction-level. However, the performance improvement that may be achieved through exploitation of instruction-level parallelism is limited. Various constraints imposed by limited instruction-level parallelism and other performance-constraining issues have recently renewed an interest in exploiting parallelism at the level of blocks, or sequences, or streams of instructions, commonly referred to as thread-level parallelism. A thread is simply a sequence, or stream, of program instructions. A multithreaded microprocessor concurrently executes multiple threads according to some scheduling policy that dictates the fetching and issuing of instructions of the various threads, such as interleaved, blocked, or simultaneous multithreading. A multithreaded microprocessor typically allows the multiple threads to share the functional units of the microprocessor (e.g., instruction fetch and decode units, caches, branch prediction units, and load/store, integer, floating-point, SIMD, etc. execution units) in a concurrent fashion. However, multithreaded microprocessors include multiple sets of resources, or contexts, for storing the unique state of each thread, such as multiple program counters and general purpose register sets, to facilitate the ability to quickly switch between threads to fetch and issue instructions.
One example of a performance-constraining issue addressed by multithreading microprocessors is the fact that accesses to memory outside the microprocessor that must be performed due to a cache miss typically have a relatively long latency. It is common for the memory access time of a contemporary microprocessor-based computer system to be between one and two orders of magnitude greater than the cache hit access time. Instructions dependent upon the data missing in the cache are stalled in the pipeline waiting for the data to come from memory. Consequently, some or all of the pipeline stages of a single-threaded microprocessor may be idle performing no useful work for many clock cycles. Multithreaded microprocessors may solve this problem by issuing instructions from other threads during the memory fetch latency, thereby enabling the pipeline stages to make forward progress performing useful work, somewhat analogously to, but at a finer level of granularity than, an operating system performing a task switch on a page fault. Other examples of performance-constraining issues addressed by multithreading microprocessors are pipeline stalls and their accompanying idle cycles due to a branch misprediction and concomitant pipeline flush, or due to a data dependence, or due to a long latency instruction such as a divide instruction, floating-point instruction, or the like. Again, the ability of a multithreaded microprocessor to issue instructions from other threads to pipeline stages that would otherwise be idle may significantly reduce the time required to execute the program or collection of programs comprising the threads.
As may be observed from the foregoing, a processor concurrently executing multiple threads may reduce the time required to execute a program or collection of programs comprising the multiple threads. However, concurrently fetching instructions from multiple threads introduces problems with respect to the instruction TLB that may make it difficult for the instruction fetch portion of the processor to supply the execution units of the processor with instructions of the threads at a high enough rate to keep the execution units busy, thereby diminishing the multithreading performance gains.
TLB Access Times
As illustrated above, it is important for the instruction TLB to provide a high hit rate to enable the instruction cache to supply instructions at a high rate to the execution units. TLB hit rate is partly a function of TLB size; the greater the number of pages for which the TLB caches translation information, the higher the hit rate, all other things being equal. However, the larger the TLB, the longer the TLB access time. It is desirable to have a fast TLB that requires only a single processor clock cycle, or fraction of a clock cycle, since the physical address is needed to fetch from the instruction cache. However, as processor clock speeds have increased, it has become increasingly difficult to design a fast TLB large enough to provide the desired hit rates. Consequently, processor designers have employed a two-tier TLB architecture that includes a micro-TLB. The micro-TLB is a fast TLB that caches page translation information for a subset of the pages whose information is cached in the larger TLB; consequently, the micro-TLB has a lower hit rate than the larger TLB. The large TLB backs up the micro-TLB such that if the micro-TLB misses, the larger TLB—since it has a higher hit rate—likely provides the physical page address information missing in the micro-TLB. However, the larger TLB supplies the information more slowly than the micro-TLB, in some cases multiple clock cycles later.
Because the micro-TLB stores translation information for a relatively small number of pages, in certain situations the hit rate of the micro-TLB may be quite low. For example, assume a four-entry micro-TLB, and assume a program executing on the processor that fetches instructions from five different virtual pages in rapid succession in a cyclical manner. In this situation, the micro-TLB will be thrashed as follows. Assume the translation information for the first four pages is cached in the micro-TLB. When an instruction from the fifth page is fetched, the virtual address of the fifth page will miss in the micro-TLB, and the micro-TLB entry for the first page will be evicted and replaced with the fifth page information obtained from the larger TLB multiple cycles later. An instruction from the first page will be fetched, and its virtual page address will miss in the micro-TLB because it was just evicted by the fifth page, and the micro-TLB entry for the second page will be evicted and replaced with the first page information obtained from the larger TLB multiple cycles later. An instruction from the second page will be fetched, and its virtual page address will miss in the micro-TLB because it was just evicted by the first page, and the micro-TLB entry for the third page will be evicted and replaced with the second page information obtained from the larger TLB multiple cycles later. This process may go on for a while, which essentially reduces the hit rate of the micro-TLB to zero and increases the effective access time of the two-tiered TLB system to the access time of the larger TLB.
Although the example just given effectively illustrates a program that could thrash a micro-TLB to cause the effective access time of the two-tiered TLB system to approach the access time of the larger TLB, the example is very unlikely to happen, and if it does, at least the program will execute, albeit slower than hoped. Nevertheless, the greater the number of disparate pages from which the program fetches instructions, and the closer together in time the disparate pages are accessed, the more the effective access time of the two-tiered TLB system approaches the access time of the larger TLB.
Multithreading Processors and TLBs
In many applications, the various threads being concurrently fetched by a multithreading processor are likely being fetched from disparate pages, and are likely being fetched close together in time. Consequently in these applications, the TLB thrashing example given above is more likely to be the rule, rather than the exception, in a multithreading processor concurrently fetching more threads than the number of pages for which the micro-TLB is caching translation information. If some of the threads are fetching between two (or more) pages close together in time, the likelihood of thrashing increases even more. As the effective access time of the TLB system approaches the access time of the larger TLB, the instruction fetch pipeline may not be able to fetch enough instructions to keep the execution units supplied with instructions, thereby potentially offsetting the gains in execution pipeline efficiency hoped for by employing multithreading.
Even worse, a pathological case may occur in which one or more of the threads is essentially starved from fetching any instructions, and therefore can make no forward progress. Assume the four-entry micro-TLB above and instructions from eight threads being concurrently fetched from eight distinct virtual memory pages in a cyclical manner. Assume the translation information for the first four threads is cached in the micro-TLB. When an instruction from the fifth thread is fetched, the virtual address of the fifth thread will miss in the micro-TLB, and the micro-TLB entry for the first thread will be evicted and replaced with the fifth thread information obtained from the larger TLB. However, because the processor has other threads for which it could be fetching while the larger TLB is being accessed, it will access the micro-TLB for the sixth thread, and the virtual address of the sixth thread will miss in the micro-TLB, and the micro-TLB entry for the second thread will be evicted and replaced with the sixth thread information obtained from the larger TLB. The processor will next access the micro-TLB for the seventh thread, and the virtual address of the seventh thread will miss in the micro-TLB, and the micro-TLB entry for the third thread will be evicted and replaced with the seventh thread information obtained from the larger TLB. The processor will next access the micro-TLB for the eighth thread, and the virtual address of the eighth thread will miss in the micro-TLB, and the micro-TLB entry for the fourth thread will be evicted and replaced with the eighth thread information obtained from the larger TLB. The processor will next access the micro-TLB for the first thread, and the virtual address of the first thread will miss in the micro-TLB, and the micro-TLB entry for the fifth thread will be evicted and replaced with the first thread information obtained from the larger TLB. This process will continue for four more cycles until the processor accesses the micro-TLB for the fifth thread and the virtual address of the fifth thread will miss in the micro-TLB, even though it was placed into the micro-TLB earlier in response to its miss in the micro-TLB. Consequently, the fifth thread can make no forward progress. In fact, in the scenario just described, no thread will make forward progress.
Therefore, what is needed is a TLB architecture with a high hit rate for a multithreading processor without a significantly increased aggregate access time.