1. Field of the Invention
This invention is related to translation lookaside buffers (TLBs) and demapping of translations in the TLBs, especially in multithreaded processors.
2. Description of the Related Art
Processors and computer systems that include the processors typically implement a virtual memory system, in which most software executing on the processors and accessing memory do so using virtual (or effective) addresses. These addresses are translated through the virtual memory system to physical addresses, which are used to access memory. A virtual memory system offers several benefits: it allows software to address a larger virtual memory space than the actual physical memory included in the system; it allows multiple independent processes to access memory while being protected from interfering with each other (e.g. by one process modifying data used by another process); etc.
Generally, the virtual memory system (and particularly the mapping of virtual addresses to physical addresses) is under software control. Software builds data structures in memory that describe the virtual to physical mappings. These data structures are usually referred to as “page tables”, since many translations translate a page of the virtual address space to a page of physical memory, aligned to a page boundary in the physical memory space. Page sizes vary, from 4 kilobytes to several megabytes or even larger. A given virtual memory system often supports more than one page size.
Performing a translation frequently requires several accesses to the page tables in memory. Even if the page tables are cached, the process of searching the page table entries is a lengthy process, as compared to the amount of time needed to execute a given instruction. The added latency on memory accesses (both instruction fetches and load/store operations) to perform the translation process each time would hamper performance. Accordingly, most processors implement a cache for a subset of the translations from the page tables, referred to as a translation lookaside buffer (TLB). The TLB caches the results of the translation process, which includes an identification of the virtual address and the corresponding physical address, as well as any protection data that may be included in the virtual memory system (again, generally under the control of software). The data cached in the TLB and used to translate a given range of virtual addresses (e.g. a page) is referred to as a “translation” for the range/page. The translation may include contents derived from multiple page table entries or one page table entry, depending on the definition of the virtual memory system.
At various points in time, the software that controls the virtual memory system (referred to as memory management software) needs to reclaim physical memory that is currently mapped (e.g. to map other virtual addresses to the physical memory). To reclaim physical memory, the memory management software invalidates the current mappings to the physical memory in the page tables and establishes the new mappings. However, the old translations may be cached in the TLBs. Accordingly, the memory management software, after modifying the page tables, must also ensure that any translations derived from the invalidated page table entries are invalided in the TLBs.
In a single-threaded processor, it is sufficient for the memory management software to issue a TLB translation removal operation (referred to herein as a demap operation) on the processor and the TLB (or TLBs) in the processor will be invalidated. In multiprocessor systems, the TLBs on each processor must be invalidated. In some cases, the demap operation is issued on each processor. In the PowerPC architecture, two instructions are used to invalidate TLBs in a multiprocessor system: a TLB invalidate instruction (tlbie) that is transmitted by the executing processor to all other processors, and a TLB synchronization instruction (tlbsync) that follows the tlbie. The executing processor also transmits the tlbsync to all other processors, which retry the tlbsync until the TLB invalidation is complete. When the tlbsync is successfully transmitted without retry, all TLBs are known to be invalidated. Since the tlbsync is retried until a TLB invalidation is complete, the possibility of livelock exists if two processors are executing the tlbie/tlbsync sequence at the same time. Thus, software must ensure that only one processor executes the code sequence that includes the tlbie/tlbsync. Additional instructions may also be required (e.g. a synchronization instruction (sync) prior to the tlbie).
In a multithreaded processor, hardware is provided,to support each thread that can be concurrently active in the processor. The hardware that supports a given thread is referred to as a “strand”. Strands may share a TLB. At any given point in time, two or more threads may have a translation cached outside of the TLB. For example, the hardware table walker may have already read the translation, prior to page table modification and in response to a TLB miss for a thread on another strand that shares the TLB, but not yet loaded it into the TLB. Alternatively, the translation may be cached by memory management software executing in response to a page fault from another thread, such as in registers that correspond to the strand or memory locations separate from the page tables. Accordingly, it is possible that, even after executing the demap operation on one strand, other threads executing on other strands that share the TLB may cause the translation to be reloaded in the TLB.
To prevent the reload after the demap operation, memory management software typically issues the demap operation on each strand that may share the TLB. Once the demap operation has completed on each strand, the translation is known to be invalidated in the TLB and will not be reloaded (since the translation is invalidated in the page tables and is no longer cached with respect to other active threads). While functionally correct, this solution is a low performance mechanism since the thread executing on each strand must be interrupted to perform the demap operation.
Even if fewer than all the strands could be interrupted to perform a demap operation, multiple strands would still need to be interrupted if multiple TLBs are included and different sets of strands share different TLBs. For example, a multicore, multithreaded processor may include multiple processor cores, each having multiple strands that share a TLB in the core. In such systems, one or more strands that share each TLB would have to be interrupted to perform a demap operation.