Within a computer system, the power and speed of computing is often limited by the actual physical memory (such as random access memory) that a central processing unit (CPU) can address. Generally speaking, a CPU can address many orders of magnitude more memory locations than actually exist in RAM. Thus, the virtual address space of a CPU is usually much greater than the actual memory space of a computer system. To remedy this deficiency, it is well known to use a virtual storage (often called virtual memory) technique for utilizing the full virtual address space of the CPU.
Virtual storage is the notion of storage space that may be regarded as addressable main storage by the user of a computer system in which virtual addresses are mapped into real addresses in physical memory. The size of virtual storage is limited by the addressing scheme of the computer system (such as of the CPU) and by the amount of auxiliary storage available and not by the actual number of real storage locations. Simply put, virtual addresses generated by a CPU may actually refer to storage locations outside of the internal physical memory of the computer, such as on a hard disk.
FIG. 1 illustrates symbolically a virtual address translation technique 10 within a computer system. Process address space 12 is a range of virtual addresses assigned to a particular process running on a CPU while physical address space 14 is the range of actual physical memory (such as RAM) available within the computer system. In the example shown, all addresses generated by a CPU for the process are virtual addresses and need to be translated into a physical address before a data element at that address can be accessed. Process address space 12 ranges from a virtual address of 0 to a virtual address of 2.sup.N-1. Physical address space 14 ranges from a physical address of 0 up to 2.sup.M-1, where the range of virtual addresses is typically much greater than the range of physical addresses. In a computer system where a CPU has 64-bit addressing capability, N=64, while the value of M may only be M=24.
In this example, the virtual address space of the process includes a section of code 16, data 18 and a stack 20. A page 22 of code 16 is mapped to a page 24 in the physical address space. Also, page 26 of data 18 is mapped to a page 28 of the physical address space. Thus, a page in the virtual address space is mapped to any page within the physical address space and a translation mechanism is needed to provide such mapping.
An increasing number of architectures provide such virtual storage support through software-managed translation look-aside buffers (TLBs). This trend is mainly due to the advent of 64-bit address space which requires huge amounts of physical memory to hold the address translation page tables. A TLB is a small hardware cache of the most recent virtual to physical address translations in the system and is managed by the operating system. During a TLB miss (i.e., when a mapping is not found in the TLB), the hardware traps into the kernel which loads a new virtual to physical translation. This dynamic approach is very flexible but depends on a high cache hit ratio and on a fast TLB-miss trap handling.
FIG. 2 illustrates in detail an example 50 of an entry in such a TLB. Shown is a TLB 52 within a CPU 54 that has an entry 56. TLB 52 may have any number of entries, although 64 entries are common. Also, there may be more than one TLB per CPU, for example, one TLB for instructions and one TLB for data. Entry 56, a translation table entry (TTE), provides a mapping from a virtual address to a physical address and includes a tag portion 58 and a data portion 60. The tag portion includes a process identifier 62 used to identify the process and a virtual address 64. The data portion includes the physical address 66 to which the virtual address is mapped and associated attributes 68. Attributes include useful parameters such as "protection", "supervisor/user state""modified", "valid", etc. A valid bit 70 is used to indicate if the entry is valid. Thus, by accessing the TLB hardware cache, A CPU is able to quickly translate a virtual address into a physical address.
However, a TLB is not able to hold all of the virtual addresses available to a CPU; thus, frequently the CPU cannot find a particular virtual address for a particular process in the TLB. This situation is called a "TLB miss". Typically, a TLB miss is handled in two ways. One technique uses hardware trap handling in order to map a virtual address when a TLB miss occurs. In hardware trap handling, the hardware uses tables in physical memory in order to look up the physical address for the given virtual address, termed a "page table walk". A number of nested segment tables and page tables are used to determine the physical address. Although this technique can be fast for 32-bit machines, it is generally not viable for 64-bit machines. With a 64-bit address space, the tables in hardware needed to implement this technique are extremely long and would take up far too much memory in the computer. Even if smaller tables are used, at least six levels of nesting of the tables may be necessary, resulting in a time-consuming look-up process. Thus, although hardware trap handling is useful for 32-bit machines, for machines with a larger address space (such as 64-bit machines) this technique is not extremely useful.
Another technique for dealing with a TLB miss is to let the operating system deal with the miss through software trap handling. Software trap handling may be implemented in many ways depending upon the system. Generally, it is desirable on a TLB miss to limit the number of memory accesses to one reference. One technique for such software trap handling uses a "translation storage buffer" (TSB) to support the TLB and is useful on larger address space machines such as on 64-bit machines. A translation storage buffer is useful on UltraSPARC work stations manufactured by Sun Microsystems, Inc. of Mountain View, California, running a Solaris version 2.6 operating system. A more detailed explanation of a translation storage buffer (TSB) may be found in the above-referenced patent application.
Such a TSB cache is backed by a table of all valid address mappings in the system. The TSB is a contiguous buffer in memory that is usually organized as a direct-mapped cache. The operating system uses the TSB for fast TLB-miss trap handling. The TLB miss trap handler searches the TSB for the missing address translation. If the search operation generates a hit, the address mapping is added to the TLB.
FIG. 3 illustrates symbolically one embodiment of a software trap handling technique 80. Shown is a translation look-aside buffer (TLB) 52, a translation storage buffer (TSB) 82 and a hash table 84. On a TLB miss, the operating system would next check TSB 82 to locate an address mapping. TSB 82 is an operating system data structure that caches the most recent address translations. The operating system uses the TSB for fast TLB-miss trap handling. On a TLB miss, the TLB miss trap handler searches TSB 82 for the missing address translation. If the search operation generates a hit, the address mapping is added to TLB 52. If an address mapping is not found in TSB 82, the operating system then checks hash table 84 which is a table of all valid address mappings in the system. Once the correct mapping is found, it is inserted in the TLB and into the TSB so that it may be found when needed next. Hash table 84 need not be a hash table, but may be any suitable data structure in memory. Thus, a TSB is a cache of recent address translations useful when a TLB miss occurs.
FIG. 4 shows in detail a TSB entry 90. As in the translation table entry (TTE) of FIG. 2, TSB entry 90 includes a process identifier 92, a virtual address 94, a physical address 96 and attributes 98. Also included is a valid bit 100 to indicate if the entry is valid.
Although a TSB is used in 64-bit machines (and is possible in 32-bit machines), there are disadvantages associated with the ways in which a TSB is used. Some systems use one global TSB for all CPUs available within a computer system. Thus, on a TLB miss for any of the CPUs within the computer system, only one TSB is used as a cache for recently used address translations. Use of one global TSB has certain drawbacks. For one, a global TSB may not have the capacity to adequately serve all processors within a computer system. That is, it may be difficult to make the global TSB large enough in memory to efficiently serve all of the processors. This decision is somewhat implementation dependent. A TSB might be given more memory at the expense of memory for users. Even with a large global TSB, context invalidation can be very costly in terms of time because the complete TSB must be walked in order to determine that an address translation is not present.
Also, with all processes on all CPUs using one TSB for caching translations, there is the problem of thrashing. Thrashing happens when different processes generate the same index into the TSB and "take one another out" in the global TSB, even though there are free slots in the TSB. Thrashing is a common problem where many processes vie for the use of one TSB. In particular, if an XOR function is used to generate an offset from a process identifier and a virtual address, then thrashing is likely to occur because the XOR function is not particularly suited for good distribution. Also, many systems are limited to a process identifier of 13 bits, which further limits the distribution of addresses in the TSB and thus leads to thrashing.
A further disadvantage of having only one global TSB is that this TSB is fixed in size, whereas each process has different characteristics and needs. In other words, a simple process such as a shell or mail program may only require a very small TSB, whereas a large database application may require an extremely large TSB to be able to map to many more addresses. The global TSB, being of only one size, is inflexible and leads to inefficiency in address translation for these different types of processes. Inefficiency also results during a TSB flush operation when the complete TSB must be walked to invalidate entries for a process no longer in use.
Aside from the global TSB approach, other systems assign a single TSB to each CPU within a multi-processor computer system. The TSB capacity problem is alleviated because many more TSBs are available for use. However, this "TSB per CPU" approach has other disadvantages. Thrashing is reduced between processes on different CPUs, but processes on the same CPU will still encounter thrashing on the TSB to which that CPU is assigned. Thus, thrashing is reduced but not eliminated. Furthermore, as in the global TSB technique, each of these TSBs is of a fixed size and is not tailorable to the needs of a particular process.
Another drawback of this technique is excessive memory usage since each CPU is allocated a private TSB. For larger systems, often the worst case must be assumed for each TSB per CPU, resulting in large amounts of space being assigned to each TSB. If a CPU is unused, or if only a small portion of the TSB is used, then space is wasted.
Also, because there is only one TSB per CPU, there may be duplicate translations for a kernel variable in different TSBs. For example, if a first CPU has a TLB miss for a kernel variable A, it places a translation in its TSB. However, if a second CPU also has a TLB miss for the same kernel variable A, it will place a duplicate translation for that kernel variable in its own TSB, resulting in duplication of effort. Another disadvantage of this technique is that a process migrating from one CPU to another must continuously repopulate its new TSB with its address translations, resulting in more inefficiency. This is known as a cold cache syndrome.
Another problem with each of the above approaches is the cost of TSB context invalidation due to context invalidation. This cost is critical in multi-user environments where a large number of small processes are created and destroyed. Especially with a large global TSB, it can be very time consuming to search the complete TSB to invalidate one context. Likewise with the TSB per CPU approach, each TSB may have to be searched to invalidate a context. Also, another drawback common to both of the above techniques is that there is no separate TSB for system uses. The global TSB, or each per processor TSB must be used for both user and system processes.
In addition, the embodiments described above store a TSB in physical memory. For example, a global TSB may be of a size that it can be stored in the nucleus of the kernel. In the TSB per processor embodiment, each TSB is typically of a larger size and is not stored in the kernel but rather outside of the kernel in physical memory. A greater shortage of physical memory within the computer system results from these techniques.
Therefore, a TSB implementation technique is desirable for large address space computer systems that would minimize or eliminate TSB thrashing between different processes and would reduce the cost of TSB context invalidation.