Modern operating systems often use virtual memory schemes in order to maximize the use of physical storage space in processors. Virtual memory is well known in the art. Virtual memory can be addressed by virtual addresses. The virtual address space related to a program is conventionally divided into pages. Pages are blocks of contiguous virtual memory addresses. While programs may be written with reference to virtual addresses, a translation to physical address may be necessary for the execution of program instructions by processors. Page tables may be employed to map virtual addresses to corresponding physical addresses.
Memory management units (MMUs) are commonly used to handle translation of virtual addresses to physical addresses. MMUs may look up page table entries (PTEs) which include the virtual-to-physical address mappings, in order to handle the translation. Physical memory space may be managed by dynamically allocating and freeing blocks of the physical memory or data buffers. In this process of dynamic allocation and freeing, it is common for the free physical memory space to become fragmented, comprising non-contiguous free blocks. Thus, a contiguous range of virtual addresses may become mapped to several non-contiguous blocks of physical memory. Accordingly, the page table look up process, also known as “page table walk” may need to be performed frequently, as contiguous virtual addresses may not conveniently map to contiguous physical addresses. These frequent page table lookups may significantly impact performance.
One conventional technique to address the frequent page table lookups includes the use of a translation cache, also known as a translation lookaside buffer (TLB). A TLB may cache translations for frequently accessed pages in a tagged hardware lookup table. Thus, if a virtual address hits in a TLB, the corresponding physical address translation may be reused from the TLB, without having to incur the costs associated with a page table walk.
With reference to FIG. 1, a conventional implementation of a TLB is illustrated. As shown, TLB 100 is configured to translate a virtual address (VA) 102 to a physical address (PA) 104. The virtual addresses A, B, C, and D are stored in content addressable memory (CAM) 112. A portion of the virtual addresses form tags 110 which are compared against the virtual address 102 to determine if a translation for virtual address 102 is present in TLB 100. If there is a hit, then the corresponding physical address: P, Q, R, or S, is read out of random access memory (RAM) 106, using de-multiplexor logic 108, in order to form physical address 104. A virtual address, such as A, B, C, or D, along with its corresponding physical address translation, P, Q, R, or S, may combinedly be referred to as an entry in TLB 100.
Entries in a TLB, such as TLB 100, may be populated in several ways. The common criteria for evaluating fills, or updates, in a TLB include: when to fill a TLB entry; where in a TLB a new entry may be placed; and how many TLB entries may be filled during each update of the TLB. Two conventional techniques are known in the art for updating or filling TLP entries.
The first technique, referred to as “software fill” meets the above criteria for filling a TLB as follows: fills are performed pursuant to a fill command initiated in software; fills are performed at addresses specified by software; and conventionally, software fill applies to a single entry in the TLB.
The second technique, referred to as “demand fill” is usually employed in MMUs for central processing units (CPUs). Demand fill meets the criteria for filling a TLB as follows: fills are performed pursuant to a miss in the TLB; fills are performed at the address which caused the miss; and demand fill conventionally applies to a single entry in the TLB, corresponding to the address which caused the miss. One advantage of demand fill is that TLB entries are only filled as needed, pursuant to a miss. However, an accompanying disadvantage lies in that stalls are introduced to the TLB user every time the TLB is waiting for the required TLB entry to be filled.
Demand fill may be an adequate solution for CPUs because programs generally include a high degree of temporal locality, i.e. the same addresses are accessed close together in time. However, streaming accesses may sometimes not exhibit temporal locality. Streaming accesses may generally comprise sets of one or more accesses which may executed in a burst. For example, streaming accesses which may not exhibit a high degree of temporal locality may include accesses from multimedia accelerators (e.g. display engines), and accesses from CPUs during a memory copy. In contrast to reusing the same addresses, such streaming accesses lacking in temporal locality may be directed to a new address every time. Accordingly, demand fills may cause degradation in performance of TLBs. Such degradation may be particularly visible in MMUs designed for devices other than CPUs, such as System MMUs or input/output (I/O) MMUs.
With regard to accesses other than streaming accesses (referred to as “non-streaming accesses”), the size of the TLB may be increased in limited cases, to accomplish a reduction in TLB misses. However, for streaming accesses, increasing the TLB size does not accord the same benefit. In fact, even if theoretically, the TLB size was increased towards infinity, it can be shown in several cases that the TLB performance may continue to be severely degraded by misses and accompanying stalls.
Accordingly, there is a need in the art, particularly with respect to accesses without temporal locality, for TLB filling techniques which are not plagued by performance degradations caused by high miss rates.