1. Technical Field
This invention generally relates to the management of virtual memory in a computer system. More particularly, the present invention relates to a method and an apparatus enabling an application executing in the user mode to efficiently communicate information about its future memory access behavior to the operating system.
2. Description of Prior Art
Many modern computer systems rely on some sort of virtual memory mechanism (VMM) which is controlled by the operating system (OS). A VMM is an effective way for a system to abstract away the details and the complexity of memory management, present to each process a separate virtual address space, and create an illusion of large and fast memory through the use of a hierarchy of memories of different sizes and operating at different speeds. A VMM and the OS are responsible for mapping physical pages to corresponding virtual pages (this mapping information is used by translation hardware) and processing page faults which occur as a result of accesses to virtual addresses that do not have valid translation to corresponding physical addresses.
Normally, a processor can only manipulate information (or simply data) located in the levels of memory hierarchy that are close to it (such as registers, caches, and main memory). A VMM is responsible for bringing in the data (from a disk or a network) closer to a processor (to main memory) upon request (on demand) and finding an appropriate place to store that data.
If no space is available at the level of memory hierarchy (such as main memory) where the information should be stored, some of the existing information must be evicted. A VMM and a paging daemon work together to find what information should be evicted. If the information, or simply data, to be replaced is an exact copy of that data stored elsewhere in a deeper level of memory hierarchy (such as a disk), it can be safely discarded and its space can be allocated to the data requested by a processor. On the other hand, if the data selected for replacement is a modified version of the data that was originally pulled from a deeper level of a hierarchy (such as a disk) or some new data generated by an application, then this modified or new data is copied into a deeper level of a memory hierarchy first before giving away its space.
Unfortunately, typically, a VMM is demand-driven (reactive) and because of that it suffers from a number of problems.
a) Pages that were previously saved to the external storage (such as a swap space on a disk) are brought into memory when they are referenced—on demand. Since the cost of finding the swapped out page on a disk, reading it, and bringing it to memory is high, a program will need to yield a CPU to another process and wait (for, potentially, thousands of processor cycles) without making any progress until the requested page arrives. Similarly, if a program faults while referencing a page corresponding to a memory-mapped-file, it will have to stall until the page becomes available, at which point the program can continue its execution. Hence, a system where programs often wait for a requested page to arrive is likely to have poor performance.
b) In many systems, pages are not evicted, not swapped out to a disk or freed until the number of free page frames in the free list, which holds all memory pages not currently in use, falls below a certain threshold. When that happens, a system uses various heuristics to find pages that can be freed and added to the free list. When a system is low on memory, inactive pages (such as those that were not referenced recently) are-usually evicted first and then written to a disk if they are in a modified state. Unmodified pages that were originally fetched from some file do not have to be written to a disk and can be freed immediately (the contents of those pages can be refetched from that file at a later time if it becomes necessary). An access to a page that has been evicted will result in a page fault.
c) There are situations when a program generates and modifies some data which at some point in time become absolutely “useless” for that program. At this point, if a request to allocate memory comes in and little or no free space is available, some of the data may have to be written back to a disk regardless of the usefulness of that data. Indeed, when a modified page is selected for eviction, normally, it is blindly written to a disk even if a program that owns that page will never need to read the data stored on that page again. Even worse, an attempt to write to that page (which has been evicted) by a program will cause a page fault and force the system to bring in a page that has no useful content for that program. Therefore, some of the data copied back to a disk and, possibly, fetched from a disk at a later time, may be “useless” data. Such a data will occupy space on a swap disk. Writing this data to a disk and reading it later consumes valuable system resources and needlessly stalls a program.
There exist a number of solutions that allow a program to communicate to the OS various information about its future memory behavior.
Some work has been done related to the use of hints for prefetching data from disks to memory to reduce the I/O stall time. The system described in “Automatic compiler—Inserted I/O Prefetching for Out-of-Core Applications,” Mowry, et al., lets a compiler (as well as a programmer) to supply “prefetch” and “release” hints for managing file I/O (in regular array-based applications).
Hint-bits are kept in a special designated communication area (DCA)—a set of shared memory pages allocated when an application is started. A bit in a “shared” page corresponds to one page (or a set of contiguous pages which may not always be convenient) in the virtual address space of an application.
This DCA may have to be pinned in memory by the OS to ensure that when the OS needs to access the DCA, the DCA is present in main memory. A large 64-bit virtual address (VA) space may require a large number of pinned pages to be allocated for such a bit-map which decreases the amount of memory available for other purposes.
(For example, if one bit is used to encode a 4k page, then one “shared” page can cover 128M (2^27) of virtual address space. Therefore, 32 (2^(32-27)) “shared”pages are necessary to cover the entire 32-bit VA space and many more pages (2^(64-27)) are needed for a 64 bit-address space.)
Addressed_space = page_size * bit_per_page == page_size * (page_size_in_bytes *bits_byte) == 2{circumflex over ( )} 12 * (2{circumflex over ( )}12 * 2{circumflex over ( )}3) = 2{circumflex over ( )}27
An alternative solution is to have a bit-map for only a subset of pages that is of interest to an application. While it decreases the size of the DCA, it also increases the run-time overhead associated with translating virtual addresses to corresponding bits in the DCA.
The release hints, as described in “Automatic compiler—Inserted I/O Prefetching for Out-of-Core Applications,” Mowry, et al., in OSDI '96, instruct the OS (by making a system call which is fairly expensive) to free unmodified (clean) pages and write out to disk all modified (dirty) pages including those pages that contain scratch data that will never be used by an application. In order to let a compiler be aggressive in generating hints and, at the same time, limit the overhead associated with making a system call to the OS, the authors of Mowry, et al. use a run-time library to filter duplicate hints. Unfortunately, not all duplicate hints can be filtered and a system call is issued for every hint that could not be eliminated by a run-time library. The hints in Mowry, et al. can only be specified for pages corresponding to memory-mapped files as opposed to arbitrary pages (such as those containing program stack and heap). There is no support in Mowry, et al. for canceling of hints.
Patterson, et al., in “Informed Prefetching and Caching,” in SOSP '95, propose proactive mechanisms (informed prefetching and informed caching) to be used by I/O-intensive applications (read-intensive applications in particular) to improve file buffer management. The hints supplied by an application (through ioct1( ) system call) disclose some information about the future accesses to the file system. The hints proposed by Patterson, et al. are used by the OS for aggressive file prefetching (to reduce the I/O stalls) and intelligent cache replacement (to reduce the chance of evicting data that will be reused). Two hints are described for the following access patterns: a) for reading files sequentially from beginning to end and b) for reading a file according to an ordered list of intervals. The authors demonstrate the benefits of their mechanisms in multiprogrammed environment.
However, the system described in Patterson, et al. does not apply the hints for virtual memory management but uses them for file system resource management only. In addition, there is no support in Patterson, et al. for specifying to the OS what data will never be accessed and can be evicted and invalidated.
The importance of reducing disk I/O to improve performance in a system with demand-driven virtual memory is the main motivation for the work described in “The Case for Compressed Caching in Virtual Memory System,” Wilson, et al., in USENIX Technical Conference '99. Wilson, et al. re-evaluate the idea of compressed caching for virtual memory in the context of modern systems and the current technology trends—fast processors, moderately fast memory, and slow disks. They show that the OS can substantially reduce paging and improve throughput by keeping some pages in a compressed form in main memory. In the system they propose, a part of physical memory is allocated for a compressed cache which can be thought of as an extra level of hierarchy between main memory and a disk storing (compressed) swapped out pages. Hence, the compressed cache acts as a backing store for uncompressed regions of memory.
The system proposed in Wilson, et al. demonstrates the benefit of reducing disk I/O. In fact, the real benefit is likely to be even higher than that reported in Wilson, et al. because the authors chose not to compress pages containing executable code.
Their system is reactive—it passively observes the behavior of an application and makes some predictions about its future behavior based exclusively on the past behavior. The system in Wilson, et al. does not have any support for an application to point out to the OS what pages (such as those containing useless data) do not have to be backed up. As a result, more pages than necessary will have to be compressed and more compressed pages will have to be written back to a disk. The former will increase cache pollution and increase the amount of data copied by a compression procedure, while the latter will increase the amount of data transferred between memory and external storage.
U.S. Pat. No. 5,559,978 to Spilo describes a system resembling closely the one discussed in Wilson, et al. Similarly, there is no support for supplying hints to a secondary virtual memory control program to instruct it to place a page directly on the free list and bypass the compression phase.
U.S. Pat. No. 5,544,349 to Berry et al. describes a system where some memory pages are compressed to increase the amount of code and data that can be present in main memory and to reduce paging activity. The system proposed by Berry, et al. does not have any support for communicating hints to the OS and does not take into account the fact that some pages do not have to be compressed.
Although the DIAGNOSE X‘214’ operation (available on S/390 servers) can be used to cancel a pending page release, a separate explicit request must be issued directly to the OS. DIAGNOSE X‘214’ (which resembles a software interrupt) is a relatively expensive call, and it costs at least a few hundred of instructions to invoke it. In fact, the cost of making a call to DIAGNOSE X‘214’ is close to the cost of processing a page fault (˜600 instructions). Also, DIAGNOSE X‘214’ can only be used in the supervisor (privileged) mode. Finally, DIAGNOSE X‘214’ is normally used by the OS itself to release large chunks of memory when a process terminates.
There are several subroutines in AIX *operating systems (as well as in a number of other UNIX*-like OSs) that allow an application to communicate some information about its memory requirements to the OS. (UNIX is a trademark of UNIX System Laboratories, Inc. AIX is a trademark of International Business Machines, Inc.) Specifically, an application can indicate that it no longer needs as much memory as it currently has.
Subroutine shmct1( ), which controls shared-memory operations, can reduce the size of a shared-memory segment but only from the end.
Similarly, truncate( ) and ftruncate( ) can change the length of a regular file.
Subroutines brk( ) and sbrk( ) can be used to control the amount of space allocated for the data segment of the calling process.
Again, the size of the data segment can only be adjusted from the end.
msync( ) subroutine, transfers (MS_SYNC and MS_ASYNC) modified pages of mapped file regions to a disk. It can also invalidate (MS_INVALIDATE) all cached copies of the pages that belong to a mapped file. However, pages occupied by program's stack, heap, and static data cannot be mapped to a file. Hence, msync( ) cannot be used to specify that certain modified pages containing stack or heap data should be discarded.
madvise( ) subroutine can be used by an application to advise the OS of expected paging behavior. One of its options, MADV_DONTNEED, is used to instruct the OS that certain pages (corresponding to a mapped file region) will not be needed.
The instruction sets of several modern microprocessors include a number of instructions for managing on-chip instruction and data caches as well as instruction and data translation look-aside buffers (TLBs). TLBs are used to speed up the translation of virtual memory addresses to physical memory addresses. Among other things, TLBs keep the translation information for a few recently accessed pages.
Some of the instructions for managing caches and TLBs are privileged (such as those that invalidate entries in data caches and TLBs) and, due to their destructive effect, are available to the supervisor-mode (privileged) processes only, while others (such as hints for prefetching a cache line), can be used by user-level processes.
Some cache management instructions are used by the OS to ensure that the DMA (Direct Memory Access) transfers are performed correctly. Usually, a DMA controller cannot fetch the data from a processor cache or place the data directly into the cache. Hence, the data must be flushed (copied) from the cache to system memory and the cache lines must be invalidated before performing a DMA transfer. In addition, the contents of the DMA buffer must be flushed (copied) to memory once the transfer is complete. These steps are necessary to ensure that the main memory always contains the most recent data (as well as the only copy of the data) immediately before and after a DMA transfer.
In processors that have separate instruction and data caches, modifications done to a cache block in the data cache are not usually propagated directly to the instruction cache. In this case, the data must be flushed from the data cache to memory. Then, the corresponding instruction cache blocks and the instruction prefetch buffer must be invalidated.
Translation look-aside buffers (TLBs) are not usually kept coherent with respect to page tables stored in main memory. Therefore, in any uniprocessor as well as a multiprocessor system, once a page table entry is modified, a corresponding TLB entry (or an entire TLB) must be invalidated (with a special privileged instruction) to ensure correctness by forcing a subsequent reload of a TLB entry as a result of a TLB miss.
Therefore, what is needed is a communication channel that will allow user-level programs to specify to the OS what pages it will and will not be referencing soon; the degree of importance for a given page to be in memory; the content of what pages can be safely discarded. (invalidated), and how should invalidated pages be initialized on a page fault. (Those skilled in the art will understand that other useful information may be exchanged between a program and the OS.) A page can be any arbitrary page including those containing program stack and data.
It is important that this communication channel be very efficient so that it can be used aggressively when it is necessary. Therefore, this channel should avoid using system calls as a way to communicate hints to the OS.
It is desirable to have advisory (non-binding) hints. The OS must be able to simply ignore the advisory hints (for example, when enough memory is available). If the OS chooses to disregard program-supplied hints, the overhead and the loss of performance should be minimal. A program should be able to batch the hints and the OS should not have to act upon the information in the hints immediately. If possible, the OS should not be involved in processing the advisory hints until, possibly, a context switch or a page fault.
A program should have an opportunity to “change” its mind and “cancel” certain hints for a page for which it is still an owner. Canceling those hints should not require making an explicit call to the OS. It should be inexpensive and transparent for both an application and the OS. The communication channel should scale to large address spaces.
The prior art has no references to a communication channel with the characteristics listed above.
Since none of the communication channels described in the prior art have the desired characteristics, new methods for exchanging a variety of application specific information with the OS are needed for efficient management of virtual memory.