1. Field of the Invention
The present invention is directed toward digital computing systems. More particularly, it is directed to memory management units for digital processors and the like.
2. Background of the Related Art
Caches, Translation Lookaside Buffers (TLBs) and memory management units (MMUs) are ubiquitous in microprocessor design. For general information on such microprocessor structures and management schemes, see J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach (1996), Chapter 5. Generally, the speed at which a microprocessor (e.g. a CPU or processor core) operates depends on the rate at which instructions and operands can be transferred between memory and the CPU. A related issue is how efficiently the space in memory can be used. A memory management design including structures such as those listed above should be capable of allowing a system designer to address at least these issues.
Referring to the processor system illustrated in FIG. 1, cache 110 is a relatively small but fast random access memory (RAM) used to store a copy of at least a portion of main memory 130 in anticipation of future use by the CPU 120. Accordingly, the cache 110 is positioned between the CPU 120 and the main memory 130 (which can also be implemented by slower RAM) as shown in FIG. 1, to intercept calls from the CPU 120 to the main memory 130. When data from memory is needed by a program executing on CPU 120, it can quickly be retrieved from the cache 110, rather than from the slower main memory 130.
A memory management unit (MMU 140) is a hardware component that manages a virtual memory system by, for example, translating virtual addresses into physical addresses for accessing data or program memory in accordance with the needs of a program executing on a processor (see the Hennessy and Patterson reference above). Typically, the MMU is part of CPU 120. However, in some designs it is a separate chip. MMU 140 is shown separately from CPU 120 in FIG. 1 for ease of illustration.
The MMU typically includes a small amount of fast memory that holds a translation lookaside buffer (TLB) for caching a portion of the page table (which may or may not be stored in main memory 130). Attempted accesses to data in memory by programs executing on CPU 120 are sent to MMU 140 and cache 110. In return for the virtual address from the CPU, cache 110 will provide the requested data and MMU 140 will provide the corresponding physical address. In case of a cache 110 hit, the CPU 120 will use the data supplied by the cache 110. In case of a cache 110 miss, CPU 120 will use the translated physical address to retrieve the data from memory 130.
In general, caches can be indexed (address portions used to find an entry) and tagged (address portions used to compare an entry) as follows: (1) physical (i.e. translated) index, physical tag; (2) virtual (i.e. un-translated) index, virtual tag; and (3) virtual index, physical tag. A disadvantage of the physical index, physical tag scheme is that the MMU needs to perform translation before cache access can begin. In the virtual index, virtual tag scheme, although the MMU translation is not needed before cache access, the cache must be properly purged upon any change to the page table. In the virtual index, physical tag scheme, the MMU and cache access can begin in parallel, but tag comparison requires the physical (i.e. translated) address from the MMU. Although the invention described in detail herein will refer to a virtual index, physical tag scheme, those skilled in the art will be able to apply the teachings of the invention to other schemes as well.
The TLB in MMU 140 is typically organized to hold only a single entry per cache index (e.g. a portion of the virtual address), wherein each TLB entry comprises, for example, a physical page number, permissions for access, etc. In contrast, cache 110 is typically organized into a plurality of blocks, wherein each block has a corresponding tag (e.g. a portion of the virtual address) and stores a copy of one or more contiguously addressable bytes of memory data. It should be noted that there may be separate caches for instructions and data from main memory (i.e. I-Cache and D-Cache), and correspondingly separate TLBs. However, such additional details are not shown in FIG. 1 for ease of illustration.
In addition to translating the physical address into the corresponding virtual address for the desired memory data, MMU 140 will also determine whether the page corresponding to the desired virtual address is in memory 130 or whether the page needs to be fetched from secondary storage (typically, a larger but slower memory such as a hard disk). To accomplish this, each entry in the page table (and TLB) typically includes a valid/invalid bit that distinguishes whether or not the corresponding page is in memory 130. If the program tries to access a page that is not in memory 130 as indicated by its valid/invalid bit being set to invalid, MMU 140 generates a page fault which traps to the operating system also executing on CPU 120. Typically, the operating system then chooses a page frame to replace in memory 130 based on frame usage patterns and writes its contents from memory 130 back to secondary storage. It then fetches the page that was just referenced from secondary storage and inserts it into the freed page frame in memory 130. The valid/invalid bit of the entry in the page table corresponding to the replaced page is cleared (i.e. set to invalid) and that of the newly fetched page is set to valid. It should be noted that entries in the page table (i.e. PTEs) can also include a “dirty bit” that indicates whether the page has been written to by the processor. A page that is not “dirty” can be replaced without writing it back to secondary storage.
FIG. 2 shows an example operation of MMU 140 in more detail. Generally, the virtual address space (the range of 2^(M+N) addresses used by the processor) is divided into pages, each of which has a size of 2^N words (ranging from a few kilobytes to several megabytes). The bottom N bits of the address designate the offset within a page and are left unchanged. The upper M address bits indicate the virtual page number. The MMU 140 contains a TLB 202 which is indexed (possibly associatively) by the virtual page number. As mentioned above, each page table entry (PTE) cached in TLB 202 gives the physical page number corresponding to the virtual one. This is combined with the page index to give the complete physical address. It should be noted that the physical page number need not comprise the same number of bits as the virtual page number. As shown in FIG. 2, the virtual page number may also be combined with an address space identifier (ASID) as a further way to index TLB 140 in accordance with protection and access schemes that are known in the art, and will be described in more detail hereinbelow.
To further illustrate by example, with a page size of 4K (i.e., 2^12) and 16 bit addresses, the virtual address is split into a 4 bit page number and a 12 bit offset. With 4 bits for the page number, it is possible to represent 16 pages, and with 12 bits for the offset, all 4096 bytes within the page can be accessed. As set forth above, a PTE may also include information about whether the page has been written to, when it was last used (e.g. for a replacement algorithm employed by an operating system to determine which page to replace upon a page fault), what kind of processes may read from and write to it (i.e. user mode, supervisor mode, permissions, access modes, etc.), and/or whether it should be cached and how (e.g. allocate, write-back, write-through, etc.).
As is known, there are different types of caches, ranging from direct-mapped caches, where a block can appear in only one place in the cache, to fully-associative caches where a block can appear in any place in the cache. In between these extremes is another type of cache called a multi-Way set-associative cache wherein two or more concurrently addressable RAMs can cache a plurality of entries for a single cache index. That is, in a conventional N-Way set-associative cache, the single cache index is used to concurrently access a plurality of entries in a set of N RAMs. The number of RAMs in the set indicates the number of Ways for the cache. For example, if the cache index is used to concurrently address entries stored in two RAMs, the cache is referred to as a two-Way set-associative cache. Although not shown in detail in FIG. 2, TLB 202 can also be implemented using a range from direct-mapped to fully associative types of caches.
In case of a cache miss (in cache 110 and/or TLB 202), a determination is made to select one of the blocks/entries for replacement. Methods of implementing a replacement strategy for data in a cache are known in cache design. Typically, the replacement of cache entries is done in a least recently used (LRU) manner, in which the least recently used block is replaced. A more flexible strategy is the not most recently used (NMRU) approach, which chooses a block among all those not most recently used for replacement. Blocks may also be selected at random for replacement. Other possible strategies include pseudo-LRU (an approximation of true-LRU that is more easily implemented in hardware), Least Recently Filled, and clock algorithms similar to those used by software for managing the replacement of pages in a page table.
Although the above described features of MMU 140 are valuable in microprocessor design, they face some limits in terms of functionality. That is, being a hardware component, once an MMU is designed, its functionality is fixed and associated circuitry implements that design. It would be desirable if the same basic MMU design was configurable so that in a configurable processor system, for example, the MMU could be configured along with the rest of the processor circuitry. For example, it would be desirable to configure a MMU according to such design parameters as page size, associativity, number of ways in the TLB, the number and types of bits for protection schemes and access modes, and independent design of ITLBs and DTLBs based on a common set of parameters.
U.S. patent application Ser. No. 09/246,047 (TEN-001), commonly owned by the present assignee, the contents of which are incorporated herein by reference, dramatically advanced the state of the art of configurable processors. The system described in that application includes a user interface and a build system for defining a configurable and extensible processor based on user selections, complete with software development tools for creating and debugging software for executing on the defined processor.
Although the above invention allows many aspects of the processor to be configured to the user's specifications, the MMU of the processor can not be directly configured. Such configurability could allow the MMU to provide its services at a cost more directly proportional to the needs of the system. For example, many embedded processor systems use a static memory map known at system design time. Using a run-time programmable MMU in these systems is wasteful in gates and power. Conversely, more general-purpose processor systems require runtime programmability because a diverse set of applications run on these systems and no one static choice could satisfy all of their requirements. It would be desirable to allow the system designer to configure MMUs having run-time programmability that spans the range from completely static (i.e. more suitable for typical embedded processor systems) to completely dynamic (i.e. more suitable for typical general-purpose processor systems). For example, MIPS and x86 are general purpose processors that have MMUs with fixed numbers of TLB entries and fixed features such as demand paging, which features could be wasteful for embedded applications. In addition, it would be desirable if the processor's MMU could support such options as variable page sizes, multiple protection and sharing rings, demand paging, and hardware TLB refill.
One way to provide a configurable MMU in a processor such as that made possible by the above-identified application would be to separately generate it as a Verilog or VHDL module. Such a module would allow the processor would hand the MMU a virtual address, and the MMU would hand back a physical address and the access modes. This would require, however, designing an efficient way to do the translation, i.e., what circuits to use, and also understanding all the implications of fitting this module into the processor pipeline, how to raise exceptions, and so forth.
One possible way around this would be to allow system designers to specify their translation in the TIE language developed by Tensilica, Inc. of Santa Clara, Calif., and then a TIE compiler-like tool could integrate it with the base processor design. It would be more desirable, however, to provide a more purely configurable approach. It would be further desirable to identify a set of configuration parameters that could specify almost everything one might want to do with an MMU. Given that configurability is simpler than extensibility, when it does the job it should be used. The parameters used in the MMU should be portable to a wide variety of implementations. Thus, software using a particular MMU configuration could run on a variety of processor implementations or generations.