Memory Management Units (MMUs) are commonly used in microprocessors to provide virtual memory capability. When virtual memory is enabled, software executing on the processor only sees and uses Virtual Addresses (VA). The MMU is tasked to convert a VA into a Physical Address (PA) that can then be used inside and outside the processor. Using virtual memory has a number of advantages including being able to give the illusion of more memory than is actually available, giving access to a physical memory system that has more address bits than are supported by the software, and protection of the physical memory with varying access rights.
Some modern systems that support virtualization have two levels of translation between VAs and PAs. The first level is similar to that found on non-virtualized system, but the PA is not the final PA. It may be called Intermediate Physical Address (IPA) or Guest Physical Address (GPA). The second level maps that intermediate address into a final PA. In these systems, for any software running on the processor, the first level or the second level or both may be enabled.
In general, the virtual address space is divided into pages. Pages are commonly a few kilobytes, though other page sizes can be used. Systems often support multiple page sizes from a few kilobytes to a few megabytes or even gigabytes to increase translation efficiency. All addresses within a page are translated in the same fashion and all the access right information is the same. The translation between VAs and PAs is done through a (often multi-level) page table. The process of going through the page table to translate a VA into a PA is often called walking as it comprises a sequence of table lookups.
The MMU often comprises two parts. The first part is called the Translation Look-aside Buffer (TLB). It caches translations so that they are very quickly accessible to the processor, so that for translations that are cached, the processor can execute with little delay. The second part is the walker, which walks the page tables when the TLB does not contain a translation. In some systems, there may be more caching between the TLB and the walker. For instance, the TLB may have 2 levels of caching. The walker may itself contain a cache.
A System MMU (SMMU) mirrors the use of the MMU, but applied to I/O devices instead of microprocessors. With an SMMU, I/O devices can also take advantage of the benefits of virtual memory and virtualization. Like an MMU, an SMMU operates on pages and uses page tables to calculate translations. In some cases, an SMMU may use the same page table formats as the MMU of a processor to which the SMMU's I/O device is connected. In that case, the page tables may be shared between the MMU and the SMMU.
Like an MMU, an SMMU often consists of a walker and a System TLB (STLB), acting as a cache for translations to help in keeping peak performance of I/O device. In some cases, multiple STLBs can share a single walker for efficiency reasons.
In most cases, TLBs inside processors are tightly integrated with the processor because physical addresses are needed inside the processor (e.g. for caches that may be visible to cache coherency). In contrast, an STLB does not have to be integrated inside an I/O device. It may be placed outside the I/O device without any negative impact. In many cases, multiple I/O devices may share a single STLB. An STLB just needs to be between the source and the destination of a data request to provide translation services. In designs that have an interconnect, STLBs may be placed between the I/O devices and the interconnect, or in the interconnect, close to the I/O devices.
It is common in many data processing fields to access a data set in a way that does not follow its memory organization. In particular, two-dimensional arrays are typically laid out in memory so that accesses along one of the dimensions are sequential in memory. However, accessing that same array along the other dimension requires non-sequential accesses to memory. Fields where this type of accesses occur include video and image capture and display, 2D processing, as well as other fields with matrix-based data processing. To represent an array with two or more dimensions in a system (for example a 2D surface) with a memory organized as a linear address space, if the address space is divided into translated pages and if the array dimensions are not much smaller than the page size, certain severe performance-hindering problems arise. Every data element or atomic unit of surface data (for example a pixel) will access a different page in either the reading or the writing step of the rotation. This at least causes a flurry of STLB misses at the beginning of the surface. If the number of rows being accessed exceeds the number of mappings in the STLB cache then every pixel in the whole surface causes a STLB miss.