To avoid high-latency accesses to operating system (OS) page tables storing virtual-to-physical page translations, processor Memory Management Units (MMUs) store commonly used translations in instruction and data Translation Lookaside Buffers (TLBs). Most contemporary systems place TLBs in parallel with the first-level cache. Due to their long miss penalties, TLB behavior affects processor performance significantly. As a result, numerous techniques have been proposed to improve TLB performance. On the hardware side, TLB characteristics such as size, associativity, and the use of multilevel hierarchies have been explored. On the software side, the concept of superpaging has been examined. Hardware/software prefetching techniques have also been investigated in detail. While effective, proposed prefetchers are specific to uniprocessors. With the growing dominance of chip multiprocessors (CMPs), it is advantageous to examine TLB performance in the context of parallel workloads. Recent characterizations of emerging parallel workloads on CMPs show that significant similarities exist in TLB miss patterns among multiple cores. This occurs either in the form of TLB misses caused by identical virtual pages on multiple cores, or in the form of predictable strides between virtual pages causing TLB misses on different cores.
Since TLBs are usually placed in parallel with first-level caches, CMPs maintain per-core instruction and data TLBs, which are largely oblivious to the behavior of other TLBs, except for shootdowns used for coherence. These TLBs are either hardware managed or software managed. Hardware-managed TLBs use a hardware state machine to walk the page table, locate the appropriate mapping, and insert it into the TLB on every miss. Because the page-table walk is initiated by a hardware structure, there is no need for expensive interrupts and the pipeline remains largely unaffected. Moreover, the handling state machine does not pollute the instruction cache. Past studies have shown the performance benefits of hardware managed TLBs, with typical miss latencies ranging from 10 to 50 cycles. Although hardware-managed TLBs do offer performance benefits, they also imply a fixed page table organization. As such, the operating system (OS) cannot employ alternate designs.
In response, reduced instruction set computing (RISC) architectures such as MIPS and scalable processor architecture (SPARC) often use software-managed TLBs. In such instances, a TLB miss causes an interrupt, and the OS executes a miss handler which walks the page table and refills the TLB. Since the OS controls the page table walk, the data structure design is flexible. This flexibility, however, comes with an associated performance cost. First, precise interrupts prompt pipeline flushes, removing a possibly large number of instructions from the reorder buffer. Second, the miss handler tends to be 10 to 100 instructions long and may itself miss in the instruction cache. In addition, the data cache may also be polluted by the page table walk.
All these factors contribute to TLB miss latencies that can span hundreds of cycles. Numerous studies in the 1990s investigated the performance overheads of TLB management in uniprocessors. Studies placed TLB handling at 5-10% of system runtime with extreme cases at 40% of runtime. To tackle TLB management overheads, early work addressed hardware characteristics such as TLB size, associativity, and multilevel hierarchies.
More recently, TLB prefetching schemes have also been explored. For example, Recency-based prefetching has been introduced to exploit the observation that pages referenced around the same time in the past will be referenced around the same time in the future. In this approach, two sets of pointers are added to each page table entry to track virtual pages referenced in temporal proximity to the current virtual page. While effective, this strategy leads to a larger page table.
In response, cache prefetching techniques have been adapted, such as Sequential, Arbitrary-Stride and Markov prefetching. These prefetching algorithms propose a Distance-based TLB prefetcher which tries to detect repetitive strides as well as the patterns that Markov and Recency prefetching provide while employing a modest amount of hardware. Specifically, the Distance-based approach tracks the difference, or distance, between successive TLB miss virtual pages and attempts to capture repetitive distance pairs in the miss stream. On every TLB miss, the goal is to use the distance between the last miss virtual page and current miss virtual page to predict the next expected distance and hence, the next miss virtual page. A prefetch is then initiated for the predicted next miss virtual page.
While these prefetchers exhibit performance benefits, they all target uniprocessors. As CMPs become ubiquitous, there exists a need to improve the performance of CMPs by reducing TLB misses across multiple cores.