Memory Management Units (MMUs) are commonly used in microprocessors to provide virtual memory capability. When virtual memory is enabled, software executing on the processor only sees and uses Virtual Addresses (VA). The MMU is tasked to convert a VA into a Physical Address (PA) that can then be used inside and outside the processor. Using virtual memory has a number of advantages including being able to give the illusion of more memory than is actually available, giving access to a physical memory system that has more address bits than are supported by the software, and protection of the physical memory with varying access rights.
Some modern systems that support virtualization have two levels of address translation between VAs and PAs. The first level is similar to that found on non-virtualized system, but the PA is not the final PA. It may be called Intermediate Physical Address (IPA) or Guest Physical Address (GPA). The second level maps that intermediate address into a final PA. In these systems, for any software running on the processor, the first level or the second level or both may be enabled.
In general, the virtual address space is divided into pages. Pages are commonly a few kilobytes, though other page sizes can be used. Systems often support multiple page sizes from a few kilobytes to a few megabytes or even gigabytes to increase translation efficiency. All addresses within a page are translated in the same fashion and all the access right information is the same. The translation between VAs and PAs is done through a (often multi-level) page table. The process of going through the page table to translate a VA into a PA is often called walking as it comprises a sequence of page table lookups.
The MMU often comprises two parts. The first part is called the Translation Look-aside Buffer (TLB). It caches translations so that they are very quickly accessible to the processor, so that for translations that are cached, the processor can execute with little delay. The second part is the walker, which walks the page tables when the TLB does not contain a translation. In some systems, there is more caching between the TLB and the walker. For instance, the TLB has 2 levels of caching. The walker itself can contain a cache.
A System MMU (SMMU) mirrors the functionality of an MMU, but applied to initiators other than microprocessors. Some such initiators are GPUs, DSPs, and DMA engines. With an SMMU, such initiators can take advantage of the benefits of virtual memory and virtualization. Like an MMU, an SMMU operates on pages and uses page tables to calculate translations. In some cases, an SMMU uses the same page table formats as the MMU of a processor to which the SMMU's initiator is connected. In that case, the page table can be shared between the MMU and the SMMU.
As an MMU comprises a TLB and a walker, so an SMMU comprise a System TLB (STLB) and a walker. The STLB acts as a cache for translations to help in keeping peak performance of initiators. In some cases, multiple STLBs can share a single walker for efficiency reasons. In some cases multiple initiators can share a STLB.
In most cases, TLBs inside processors are tightly integrated with the processor because physical addresses are needed inside the processor (e.g. for caches that are visible to cache coherency). In contrast, an STLB does not have to be integrated inside an initiator. It may be placed outside the initiator without any negative impact. In many cases, multiple initiators share a single STLB. An STLB just needs to be between the source and the destination of a request to provide translation services. Most chip designs have an interconnect that provides access by initiators to targets. STLBs may be placed between the initiators and the interconnect, or within the interconnect, close to the initiators.
The addition of virtual memory to an initiator can have severe negative effects on the performance of the initiator. In addition to the latency increase due to the extra logic, anytime the STLB misses (i.e. the translation necessary to process a request from the initiator is not cached) the request must be stalled until that translation has been resolved, either by another level of cache or a walk. This will impact the performance of the request and might also impact following requests. Since a page table walk typically takes between 1 and 20 sequential accesses and each access typically takes between 10 and 100 clock cycles, the initiator request might be stalled for a large amount of time and the performance of the initiator might be drastically reduced.
Reducing the STLB misses can be done by increasing the size of the STLB cache and by reducing the walk delay. However, this is insufficient. Some misses are compulsory misses (e.g. when a page is seen for the first time and the cache has no chance to already contain the translation) and these are not improved by the cache size. In some extreme examples, large amounts of data with poor page locality are requested by the initiator, triggering large numbers of compulsory misses.
Therefore, what is needed is a mechanism to prefetch translations into the STLB so that the probability of a request missing in the STLB is reduced or eliminated. This is particularly applicable to System TLBs because initiators tend to have predictable memory access patterns and in many cases can also be enhanced to generate advanced prefetch patterns that predict future memory access patterns.