1. Field of the Invention
The present invention generally relates to computer systems, specifically to memory subsystems for computers, and more particularly to a method of providing efficient mappings between virtual memory and physical memory.
2. Description of the Related Art
The basic structure of a conventional computer system 10 is shown in FIG. 1. Computer system 10 may have one or more processing units, two of which 12a and 12b are depicted, which are connected to various peripheral devices, including input/output (I/O) devices 14 (such as a display monitor, keyboard, and permanent storage device), memory device 16 (such as random access memory or RAM) that is used by the processing units to carry out program instructions, and firmware 18 whose primary purpose is to seek out and load an operating system from one of the peripherals (usually the permanent memory device) whenever the computer is first turned on. Processing units 12a and 12b communicate with the peripheral devices by various means, including a generalized interconnect or bus 20. Computer system 10 may have many additional components which are not shown, such as serial, parallel and universal bus ports for connection to, e.g., modems or printers. Those skilled in the art will further appreciate that there are other components that might be used in conjunction with those shown in the block diagram of FIG. 1; a display adapter might be used to control a video display monitor, a memory controller can be used to access memory 16, etc. Also, instead of connecting I/O devices 14 directly to bus 20, they may be connected to a secondary (I/O) bus which is further connected to an I/O bridge to bus 20. The computer can have more than two processing units.
In a symmetric multi-processor (SMP) computer, all of the processing units are generally identical, that is, they all use a common set or subset of instructions and protocols to operate, and generally have the same architecture. A typical architecture is shown in FIG. 1. A processing unit includes a processor core 22 having a plurality of registers and execution units, which carry out program instructions in order to operate the computer. An exemplary processing unit includes the PowerPC™ processor marketed by International Business Machines Corp. The processing unit can also have one or more caches, such as an instruction cache 24 and a data cache 26, which are implemented using high speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from memory 16. These caches are referred to as “on-board” when they are integrally packaged with the processor core on a single integrated chip 28. Each cache is associated with a cache controller (not shown) that manages the transfer of data between the processor core and the cache memory.
A processing unit 12 can include additional caches, such as cache 30, which is referred to as a level 2 (L2) cache since it supports the on-board (level 1) caches 24 and 26. In other words, cache 30 acts as an intermediary between memory 16 and the on-board caches, and can store a much larger amount of information (instructions and data) than the on-board caches can, but at a longer access penalty. For example, cache 30 may be a chip having a storage capacity of 256 or 512 kilobytes, while the processor may be an IBM PowerPC™ 604-series processor having on-board caches with 64 kilobytes of total storage. Cache 30 is connected to bus 20, and all loading of information from memory 16 into processor core 22 usually comes through cache 30. Although FIG. 1 depicts only a two-level cache hierarchy, multi-level cache hierarchies can be provided where there are many levels of interconnected caches. The main memory can further be distributed among many processor clusters in a non-uniform memory array (NUMA).
A cache has many blocks which individually store the various program instruction and operand data values. The blocks in any cache are divided into groups of blocks called sets or congruence classes. A set is the collection of cache blocks that a given memory block can reside in. For any given memory block, there is a unique set in the cache that the block can be mapped into, according to preset mapping functions. The number of blocks in a set is referred to as the associativity of the cache, e.g. 2-way set associative means that for any given memory block there are two blocks in the cache that the memory block can be mapped into; however, several different blocks in main memory can be mapped to any given set. A 1-way set associate cache is direct mapped, that is, there is only one cache block that can contain a particular memory block. A cache is said to be fully associative if a memory block can occupy any cache block, i.e., there is one congruence class, and the address tag is the full address of the memory block.
An exemplary cache line (block) includes an address tag field, a state bit field, an inclusivity bit field, and a value field for storing the program instruction or operand data. The state bit field and inclusivity bit fields are used to maintain cache coherency in a multiprocessor computer system (to indicate the validity of the value stored in the cache). For example, a coherency state can be used to indicate that a cache line is valid but not necessarily consistent with main memory, i.e., when a process has written a value to that cache line but the value has not yet migrated down the memory hierarchy to “system” memory (such a cache line is referred to as “dirty” ). The address tag is a subset of the full address of the corresponding memory block. A compare match of an incoming address with one of the tags within the address tag field indicates a cache “hit.” The collection of all of the address tags in a cache (and sometimes the state bit and inclusivity bit fields) is referred to as a directory, and the collection of all of the value fields is the cache entry array.
When all of the blocks in a congruence class for a given cache are full and that cache receives a request, whether a read or write, to a memory location that maps into the full congruence class, the cache must evict one of the blocks currently in the class. The cache chooses a block by one of a number of means known to those skilled in the art (least recently used (LRU), random, pseudo-LRU, etc.) to be evicted. If the data in the chosen block is modified, that data is written to the next lowest level in the memory hierarchy which may be another cache (in the case of the L1 or on-board cache) or main memory (in the case of an L2 cache, as depicted in the two-level architecture of FIG. 1). By the principle of inclusion, the lower level of the hierarchy will already have a block available to hold the written modified data. However, if the data in the chosen block is not modified, the block is simply abandoned and not written to the next lowest level in the hierarchy. This process of removing a block from one level of the hierarchy is known as an eviction. At the end of this process, the cache no longer holds a copy of the evicted block.
Memory is utilized by program applications as illustrated in FIG. 2. A program application is compiled using relative memory addresses, referred to as virtual memory, that correspond to locations in physical memory. For example, two processes 32a, 32b might each utilize virtual memory addresses from zero to four gigabytes (GB), but these virtual addresses map to different physical addresses in memory 16, which may provide a total physical memory of 64 GB. The memory for a process is divided into multiple “pages” to make more efficient use of physical memory, since a process usually does not need access to all virtual memory space at any given time. Virtual memory mapping allows multiple processes to share a given physical page of memory, as indicated at physical page 34. The virtual-to-physical memory mapping is handled in processor core 22 by providing a translation lookaside buffer (TLB) 29 whose entries keep track of current virtual-to-physical address assignments.
Many computer architectures support multiple virtual page sizes. Large pages (larger than the smallest virtual page size) are sometimes referred to as “superpages,” and these virtual superpages map to similarly sized “super” physical pages. Thus, in a system where the page size is 4 kilobytes (KB), a virtual superpage of 64 KB maps to 16 contiguous 4-KB physical pages making up 64 KB. Superpages typically have alignment requirements both in the virtual and physical address spaces. Thus a 64 KB virtual superpage would typically have to be aligned on a 64 KB boundary. Similarly, the 64 KB of physical page to which it maps would also have to be aligned on a 64 KB boundary. However, the use of superpages gives rise to a tradeoff. While superpages can improve the TLB hit rate by reducing the number of entries that need to be concurrently maintained in the TLB, they can also lead to underutilization of the physical memory if the application does not use the entire superpage.
This tradeoff can be resolved by providing the ability to dynamically vary the superpage size based on an application's needs. The notion of dynamically varying the superpage size based on application execution characteristics is known in the art; see, e.g., the article entitled “Reducing TLB and Memory Overhead Using Online Superpage Promotion,” by Romer et al. (22nd Annual Proceedings of the International Symposium on Computer Architecture, 1995). That solution resorts to software-directed memory copying in order to make a contiguous set of physical pages hold the application data. This approach, however, still has drawbacks. When the operating system (OS) determines that two or more (super)pages have to be coalesced into a larger superpage, it first sets aside a sufficiently large contiguous set of physical pages to map the larger superpage, and flushes any dirty lines from the caches. It next uses the processor to copy data from the original set of physical pages to the physical pages forming the superpage. Only after the copying is completed can the superpage be formed by coalescing page table entries. During this time, the application continues to suffer from poor TLB behavior.
An alternative method of handling this tradeoff uses a hardware approach, as discussed in the article entitled “Reevaluating Online Superpage Promotion with Hardware Support,” by Fang et al. (Proceedings of the Seventh International Symposium on High Performance Computer Architecture, pp. 63-72, January 2001). The Impulse approach does not actually copy the pages over into new physical locations, but instead provides an extra level of address remapping at the main memory controller. The remapped addresses are inserted into the TLB as mappings for the virtual addresses. The memory controller maintains its own page tables for these “shadow” memory mappings. Impulse also has its drawbacks. First of all, the physical superpage is not contiguous. Secondly, the size of the memory controller page tables limits the availability of remapped superpages, and the mapping table can quickly grow. In a typical 4 GB application with 4 KB pages, the mapping table could easily require more than one million entries. Finally, this lookup procedure is on the critical path to memory access. These limitations are exacerbated in NUMA systems having multiple memory controllers.
In light of the foregoing, it would be desirable to devise an improved method of superpage coalescing which is not limited by the size of hardware tables, but still allows the application TLB behavior to improve immediately, i.e., without waiting on software-directed copying. It would be further advantageous if the method were easily adapted for use with NUMA systems.