In computer memory there is generally a trade-off between capacity and speed of access. Thus for a given cost it is possible to make a memory that either allows quick retrieval of data, or that can store a large quantity of data, but not both (naturally some intermediate compromise solution can also be achieved).
As a reflection of the above limitations, modern computer systems have usually adopted a hierarchical memory system, typically involving at least two or three layers. An example of such a configuration is shown schematically in FIG. 1. This depicts a microprocessor 10 attached to a cache 11. Note that in many systems there may actually be a two-level cache (denoted L1 and L2). The L1 cache is typically fabricated on the same semiconductor device as the microprocessor 10 itself, and so provides very fast data access for microprocessor 10, whilst the L2 cache is a separate semiconductor storage device. The capacity of the L1 cache may be limited to say 64 Kbytes, whilst the L2 cache is larger, for example a few Mbytes.
(It will be understood that all memory sizes quoted herein are illustrative of current technology, but experience has shown that available memory sizes for both semiconductor and disk memory have increased steadily over recent years. This trend seems likely to continue in the future, but it does not impact the applicability of the present invention).
The next device in the memory hierarchy of FIG. 1 is random access memory (RAM) 12. This is also referred to by a variety of other names, such as physical memory, system memory, main memory, or internal storage. RAM 12 is implemented as a semiconductor device, but is significantly slower than cache 11 (RAM 12 and processor 10 are generally connected by at least one communications bus). However RAM 12 has a correspondingly greater capacity than cache 11, for example a few hundred Mbytes.
The final component in FIG. 1 is a hard disk drive 13. This is sometimes referred to as external storage, and provides non-volatile storage for a large volume of data, frequently tens or even hundreds of gigabytes. However, data access from a disk drive is a relatively slow process. Consider for example typical figures of a disk drive access time of 1 millisecond, compared to the 1 GHz clock rate of a modern processor. Thus, if the processor 10 has to wait for data to arrive from external storage 13, this can represent a delay of one million processor cycles.
In some prior art systems, processor 10 directly addressed data in RAM 12 (hence its name as internal storage). Modern processors however operate primarily at either 32 bits or 64-bits, which gives them a potential address space that is frequently much greater than the amount of physical or system memory that is actually installed. For example on a 32-bit system the processor can nominally address up to 4 Gbytes of memory, whereas the amount of RAM may be limited to some subset of this. The discrepancy between potential address space and actual internal memory is clearly much greater for a 64-bit system.
In order to accommodate this disparity, and also to provide increased flexibility, it is common for systems to adopt what is known as a virtual memory (VM) scheme. According to such a scheme, the addresses utilised by a processor do not directly represent hardware addresses. Rather they are virtual addresses which must be translated into real (physical) addresses in order to access the desired memory location. It is generally the responsibility of the operating system to manage this translation.
As an example of the improved flexibility provided by the use of virtual addresses, consider the storage of a large array. If physical memory is already partially occupied, there may be no single region of physical memory that can accommodate the whole array; rather it must be split across one or more different portions of physical memory. The use of virtual addressing allows this segmentation to be hidden from user programs. In other words, a contiguous set of virtual addresses can be assigned to the array, even although this corresponds to a non-contiguous set of physical addresses.
In combination with virtual addressing, it is also common to use a paging structure or mechanism. This divides the entire virtual address space into portions of predefined size, known as pages. All memory operations are then performed on whole pages, which therefore maintain their integrity throughout the system, both in terms of real and virtual address space.
In order to consider this further, assume that we have virtual addresses A and B, where A is the first address on a page, and B is offset from A by an amount C, less than the page size (in other words, A and B are on the same page as each other). Let us now denote A′ as representing the real (physical) address corresponding to virtual address A, and B′ as representing the real address corresponding to virtual address B. The fact that individual pages are always manipulated en bloc implies that the physical offset of B′ from A′ is again C (i.e. the same as in virtual address space). In other words, the structure within an individual page is the same for both real and virtual memory.
This property is frequently utilised in the system addressing structure, in that the high order bits of a virtual address can represent the virtual page number (VPN), whilst the low order bits represent an offset into this page. Address translation therefore simply becomes a question of looking up the physical start location of the VPN, and using this to form the high order bits of the physical address. The offset from the virtual address can then be passed through unchanged to provide the low order bits of the physical address.
A previously indicated, the virtual address space usually exceeds the amount of available physical memory, sometimes by a very large factor. Accordingly, only a certain proportion of pages of virtual addresses can actually be accommodated in RAM 12 at any one time. The remaining pages then have to be stored externally on disk drive 13. This is illustrated in FIG. 2, which shows processor 10 and also depicts schematically the range of virtual addresses 16 that are accessible to it. The virtual address space 16 is split into pages A, B, C, etc. (Note that cache 11 is omitted from FIG. 2, since it is not relevant at present to an understanding of the paging mechanism).
In order to convert any given virtual address into its corresponding real address, the operating system first identifies the page containing the virtual address. It then utilises a hardware device known as the memory management unit (MMU) 15 to identify the current real location of the corresponding physical page in RAM 12. Note that only the physical location of the start of the relevant page needs to be determined, since as explained above, the offset of any given address into a page is the same in both virtual and real address space.
Of course, at any given time, the desired (virtual) page may be out on disk 13 rather than in RAM 12. How this information is stored will depend on the details of the machine architecture. For example, the MMU 15 may simply have a flag to recognise if a page is not currently in RAM 12, with further details about the location of the page data on disk being stored on disk 13 itself. Another possibility is that page location information may be stored somewhere in RAM 12 (in other words MMU 15 is not specifically notified of page operations between RAM 12 and disk 13).
Processor 10 is unable to directly access data in external storage 13. Rather, any attempt to access data not currently in RAM 12 generates what is known as a page fault for the page containing the desired virtual address. This prompts the transfer of the relevant page from disk storage 13 back to internal memory (RAM 12) whereupon processor 10 can access it. Note that if RAM 12 is full when a new page must be added into it, then the system must first delete an existing page from RAM, writing back to external storage if it has been modified. This then provides space for the new page to be stored within RAM 12. There are known algorithms for deciding which particular page(s) should be removed from RAM if it becomes necessary to free up space for new pages (a typical such algorithm is based on the removing the least recently used (LRU) page or pages). The transfer of pages in and out of system memory according to the needs of the processor is controlled by the paging mechanism of the operating system. As used herein, the term “pageable” refers to data which is capable of being transferred in units of pages from primary storage (e.g., RAM) to secondary storage (e.g., swap space on a disk) and vice versa, or to memory which stores such data.
Although FIG. 2 illustrates just a single virtual address space 16, it is common in modern systems to have a multiplicity of address spaces. This arises because in multitasking (multiprocessing) systems, each process typically has its own virtual address space. Consequently, whilst page A for process 1 may be located at system address X, page A for process 2 may be located at system address Y (where X and Y may be in either internal or external storage).
Although in general it is desirable to separate the address space of each different process to avoid one process potentially corrupting data from another process, nevertheless, there can be occasions when different processes need to share or exchange data. One way of accommodating this is to provide shared memory, in which the address space of two (or more) processes both refer to the same (physical) region of memory.
A further known manner in which memory is divided is termed segmentation. This generally represents a higher organisational level than pages, and tends to reflect the type of information that is to be stored in any memory segment. Thus the simplest form of division is between a code segment and a data segment. Segmentation is useful for setting properties of memory; for example a code segment may be read only (once the code has been initially loaded into it), whereas a data segment may support read-write access. Whilst such access properties can be set at the page level, it is often more convenient to manipulate them en bloc according to the type of segment that a page belongs to.
The use of segments is shown schematically in FIG. 3, which illustrates the virtual address space 300 for a process. This contains two segments, one segment for process code 301, and one segment for process data 302. The remaining portion 303 of the virtual address space 300 is currently unmapped. Each of the two segments is effectively treated as a separate block, with code segment 301 corresponding to segment 1 (341) and data segment 302 corresponding to segment 2 (342). Each segment block then comprises one or more pages 350.
The hardware enforces different memory protection policies according to the nature of the segment. For example, any attempt to execute an “instruction” within data segment 302 would lead to an error, since this is not appropriate for this type of segment (but clearly would be permissible for code segment 301).
Returning now to the operation of cache 11, this is used to store data for processor 10. In particular, selected portions of data from RAM 12 are mirrored into cache 11 in accordance with known caching strategies (again this typically involves discarding LRU data from the cache). Note that the cache does not necessarily mirror whole pages of data, but rather just selected lines of a page. However, data within the cache is aligned so as to maintain page offsets. Thus if the page size is 4 Kbytes (page size is discussed in more detail below), then the first line of a page will be stored in the cache on a 4 Kbytes boundary; likewise if a virtual address is offset by 3 Kbytes into a 4 Kbytes page, then the cache line for that virtual address will be offset by 3 Kbytes from a 4 Kbytes boundary within the cache. This approach helps to enable the consistent use throughout the system of the page offset component of a virtual address (as described above).
It will be appreciated that since processor 10 utilises virtual addresses, the translation from a virtual address to the corresponding real (physical) address be performed as quickly as possible so as to provide the processor with fast access to the desired data. Many systems therefore provide a translation lookaside buffer (TLB), which effectively acts as a special purpose hardware cache for translation information from MMU 15. However, as with all memory systems there is a trade-off between speed and capacity, and although TLB is fast, the number of (most recent) translations that it can hold is limited. Thus if the processor tries to access a virtual address which cannot be resolved (mapped) by an entry in the TLB, then it must go to MMU 15. If this occurs too frequently, it will have a negative impact on performance.
Note that the systems described so far are exemplary only, and many variations and modifications of them are known to the skilled person, according to the particular machine architecture and operating environment being utilised. Some of the relevant design considerations are discussed in: “Computer Architecture: a designer's text based on a generic RISC” by James Feldman and Charles Retter, McGraw-Hill, 1994 (ISBN 0-07-113318-6).
Returning now to the question of TLB misses occurring because a processor has been unable to locate a desired virtual address in the TLB, it has been found that such a problem is particularly prone to arise in relation to certain very large application programs (such as an enterprise database). Thus such a program can occupy some 20 Mbytes of storage. Assuming a page size of 8 Kbytes, such a program will then extend over 2560 pages, which can easily be more than the number of lines in the TLB (a typical system may for example provide 512 lines in the TLB). There is likewise a tendency for such programs to access very large data structures, which will extend over a significant number of pages, and will again tend to overflow the TLB.
Consequently, system performance for such large applications can suffer a degradation in performance due to the delays caused by having to access portions of the page table not currently stored in the TLB (this can be regarded as poor scalability to larger applications).
One way in principle to mitigate this is to use a larger page size. This has the effect of allowing a greater region of memory to be represented by the fixed set of pages that can be stored within the TLB. In other words, the overall number of pages that need to be referenced should be reduced, and consequently, a greater proportion of their address translations can be retained in the TLB. This should in turn reduce the number of TLB misses, and improve system performance in this respect.
However, simply increasing the page size for the whole system has its own drawbacks. For example, having a large page size effectively increases the granularity of memory, which can lead to wastage and inefficiency. Thus if we have a page size as large as 4 Mbytes, then this is the minimum that can be allocated to a process, even if the process only needs a small fraction of this. There can also be delays in that any paging operation, such as writing to or from disk, now involves the manipulation of 4 Mbytes, compared to say 4 or 8 Kbytes if a smaller page size were used.
A more subtle problem can also be caused by the use of a large page size. To understand this, let us assume that the page size matches the size of the cache, which in turn implies that an address having page offset K must always be stored at offset K. Now let us imagine a database program manipulating multiple records, each of which is aligned on a page boundary. In these circumstances, there may be one particular field of the records which is of interest to the program, and let us say that this field is at offset K from the start of the record (and hence the start of the page). Consequently, each time this field is accessed in one record and stored in the cache (at offset K), it will need to overwrite any cache entry for the same field from a different record (which would also have to be stored at offset K). This is known as a cache collision.
Note that in these circumstances the effective capacity of the cache has been greatly reduced, since although the cache can in theory store multiple lines, each access to field K at a new record will produce a cache miss. The corresponding delay to access the relevant data will lead to a degradation of processor throughput.
It will therefore be seen that simply increasing the page size (if possible) for the whole system is an unattractive approach to the problem of TLB misses for large application programs. Rather some other strategy must be adopted to maintain the desired level of performance.