Memory Management Units (MMUs) are commonly used in microprocessors to provide virtual memory capability. When virtual memory is enabled, software executing on the processor only sees and uses Virtual Addresses (VA). The MMU is tasked to convert a VA into a Physical Address (PA) that can then be used inside and outside the processor. Using virtual memory has a number of advantages including being able to give the illusion of more memory than is actually available, giving access to a physical memory system that has more address bits than are supported by the software, and protection of the physical memory with varying access rights.
Some modern systems that support virtualization have two levels of translation between VAs and PAs. The first level is similar to that found on non-virtualized system, but the PA is not the final PA. It may be called Intermediate Physical Address (IPA) or Guest Physical Address (GPA). The second level maps that intermediate address into a final PA. In these systems, for any software running on the processor, the first level or the second level or both may be enabled.
In general, the virtual address space is divided into pages. Pages are commonly a few kilobytes, though other page sizes can be used. Systems often support multiple page sizes from a few kilobytes to a few megabytes or even gigabytes to increase translation efficiency. All addresses within a page are translated in the same fashion and all the access right information is the same. The translation between VAs and PAs is done through a (often multi-level) page table. The process of going through the page table to translate a VA into a PA is often called walking as it comprises a sequence of table lookups.
The MMU often comprises two parts. The first part is called the Translation Look-aside Buffer (TLB). It caches translations so that they are very quickly accessible to the processor, so that for translations that are cached, the processor can execute with little delay. The second part is the walker, which walks the page tables when the TLB does not contain a translation. In some systems, there may be more caching between the TLB and the walker. For instance, the TLB may have 2 levels of caching. The walker may itself contain a cache.
A System MMU (SMMU) mirrors the use of the MMU, but applied to I/O devices instead of microprocessors. With an SMMU, I/O devices can also take advantage of the benefits of virtual memory and virtualization. Like an MMU, an SMMU operates on pages and uses page tables to calculate translations. In some cases, an SMMU may use the same page table formats as the MMU of a processor to which the SMMU's I/O device is connected. In that case, the page tables may be shared between the MMU and the SMMU.
Like an MMU, an SMMU often consists of a walker and a System TLB (STLB), acting as a cache for translations to help in keeping peak performance of I/O device. In some cases, multiple STLBs can share a single walker for efficiency reasons.
In most cases, TLBs inside processors are tightly integrated with the processor because physical addresses are needed inside the processor (e.g. for caches that may be visible to cache coherency). In contrast, an STLB does not have to be integrated inside an I/O device. It may be placed outside the I/O device without any negative impact. In many cases, multiple I/O devices may share a single STLB. An STLB just needs to be between the source and the destination of a request to provide translation services. In designs that have an interconnect, STLBs may be placed between the I/O devices and the interconnect or within the interconnect.
Each STLB has a target side interface that makes memory requests using a protocol. Different I/O devices require different protocols. This makes the design of different STLBs inconsistent and therefore more complex. Address decoding is performed in the STLB and in the interconnect, including unnecessarily redundant logic, which uses silicon area and limits operating speed. The interface protocol to transport requests from STLBs to their walkers are different from the protocol used to transport requests from initiators to targets within the interconnect. This increases the complexity of verification and system level modeling. Furthermore, when using subsystem interconnects to integrate separately designed logic blocks, there is no way to transfer translation information and translation prefetch requests from initiators to TLBs through an interconnect. Further still, multiple STLBs, accessing shared translations, have no benefit from the shared locality of their requests.