A complex operating system, OS, that runs on a central processing unit, CPU, may perform memory allocation from a central memory unit, which may be in the form of dynamic random access memory, DRAM. As times goes on after an OS has booted, the memory may become more and more fragmented, to a point where the OS may have difficulties or even be unable to allocate large contiguous memory buffers, despite the fact that there is enough available memory.
There have been attempts at correcting this memory fragmentation. For some time, memory management units, MMUs, have been implemented that allow an address translation from a virtual address, VA, space that is contiguous to a physical address, PA, space that is fragmented. The VA seen by the CPU is contiguous in the virtual address space, while the PA in the physical memory is fragmented. The process of translating from one domain, e.g. the VA domain, to another domain, e.g. the PA domain, with a direct translation is called an MMU-Stage.
Because of virtualization desires, a double address translation, i.e. a two-stage translation, may be used. There is a first stage translation that translates from the CPU-running VA, to a first address space, the intermediate physical address space, IPA, which may be fragmented. A second stage translation that translates from the IPA to the final PA, which also is fragmented as discussed above, may then be performed. For example, in a personal computer, the VA may be used by the local subsystems, such as the display, the graphic card, etc., while the IPA may be used by an OS. In case the PC is to run multiple operating systems, each OS might see their own IPA, while the PA may be the data truly stored data in the physical DRAM chips.
The rationale for using a two-stage address translation is that the OS on the CPU directly controls its own MMU, the first stage, and assumes that the IPA is the final address, while virtualization software controls the second stage of the MMU. An evolved system could contain several, i.e. more than two, consecutive OS virtualization levels, with multiple consecutive stages of MMUs.
FIG. 1 illustrates schematically, in terms of functional blocks, a two-stage MMU. A first stage 104 translates from the VA, as seen/used/generated by a CPU or any hardware, HW, subsystem 102 into the IPA. A second stage 106 translates from the created IPA to the final PA that makes the true lookup into the physical memory 108.
In a certain implementation, each of the two stages 104, 106 is divided into three translation levels. At each level, a lookup is performed in a table to which a requested VA address may be redirected in the next domain (i.e. the next domain being PA for a single-stage system and the next domain being IPA for a double-stage system). The first level translates the most significant bits, MSB, of the input address (i.e. the VA for first stage and the IPA for the second stage). These bits are used in a lookup operation in a page table that resides in the DRAM, whereby a translation is obtained for the MSB. In the second level a translation is made of the middle bits of the input address. The result of the previous lookup (i.e. Level-1) is used to look up in the DRAM table where its middle bits may get data from. In the third level, the least significant bits, LSB, of the input address are translated.
The page table may be large and sometimes may not be stored locally in the MMU, but in main memory (i.e. DRAM), so that a local MMU cache is used. Such an MMU cache may be called a translation look aside buffer, TLB. Accessing the TLB may result in a miss, i.e. the desired address is not present in the TLB, with an associated cost in delaying the address translation. The TLB, being a cache, is to be refilled with address translation information, i.e. a process denoted TLB refill. In a two-stage MMU, the TLB refill (the translation information of the first stage) is requested as a default IPA request, so that it is processed by the second stage as a default IPA data, translated to PA.
Present day MMUs that are provided and implemented for complex systems with two or more stages of address translation may be constructed as consecutive concatenations of single-stage MMUs. In such MMUs, each time a requestor (CPU or any other HW subsystem) asks for data to the MMU, the MMU may perform a TLB lookup to search for the VA, in order to provide the address translation value. Each time the lookup is absent from the TLB (i.e. there is a TLB miss), the MMU may fetch that address translation information from the main memory, i.e. DRAM. Such an address translation fetch involves expenditure of some time to get the translation data back, which is a delay during which the initial request (by the CPU or other HW) is blocked in the MMU. Such a delay is a delay during which the CPU or other HW subsystem typically remains idle.
In order to avoid the CPU or other HW having to remain idle for a long period of time, a present day MMU may try to guess the next requested VA and pre-fetch from memory the translation information for this guessed next requested VA. In some cases, this is may be efficient. For instance, in the case of multimedia HW systems such as video HW decoders, camera sensors and display subsystems may create VA requests at periodic addresses that are thus highly predictable.
The MMU pre-fetch mechanism can thus fairly easily predict the address of a next (and later) request, and may perform a pre-fetch in DRAM of the data used to translate that predicted request. When the actual request arrives, the translation data may be already present in the TLB, resulting in that the translation can occur immediately without delay.
However, yet further improvements in MMUs are desired.