1. Field of the Invention
This invention relates to the field of memory management in computers, in particular in the context of address mapping in order to improve I/O speed.
2. Description of the Related Art
Many computer systems depend for their speed and efficiency on the ability to rapidly transfer data between devices and system memory. In many cases, however, addressing conventions and restrictions make it necessary to perform intermediate copies of data to be transferred before the final transfer can actually take place. Such copying can severely slow down the transfer rate.
One widely used method for increasing the input/output (xe2x80x9cI/Oxe2x80x9dxe2x80x94either or both) speed between certain devices (or other processes) and memory is known as xe2x80x9cdirect memory accessxe2x80x9d (DMA). DMA is a capability provided by some computer bus architectures that allows data to be sent directly from an attached device (such as a disk drive) to system memory, without intermediate action by the processor. In order to implement DMA, a portion of system memory is usually designated as an area to be used specifically for DMA operations. Obviously, time is lost whenever a block of data (such as a xe2x80x9cpagexe2x80x9d that is not already in the designated memory portion) must be copied to or from the designated memory portion to perform a DMA transfer.
As a concrete example, modern Intel x86 processors support a physical address extension (PAE) mode that allows the hardware to address up to 64 GB of memory using 36-bit addresses. Unfortunately, many devices that directly access memory to perform I/O operations can address only a subset of this memory. For example, network interface cards with the common 32-bit PCI (Peripheral Component Interconnect) interface can address memory residing in only the lowest 4 GB of memory, even on systems that support up to 64 GB of memory. Other 32-bit PCI devices can access memory above 4 GB using a technique known as DAC (Dual Address Cycle), but this technique requires two address transfersxe2x80x94one for the low 32 bits and another for the high 32 bits.
One known way to support output to xe2x80x9chighxe2x80x9d memory (that is, memory above 4 GB) is to copy the data from high memory to a temporary buffer in xe2x80x9clowxe2x80x9d memory for the DMA operation. For input operations, a portion of low memory in the temporary buffer is allocated for storage of the input data, which can then be copied to high memory. This technique is employed, for example, by the Linux 2.4 kernel, which uses the term xe2x80x9cbounce bufferxe2x80x9d to describe the temporary buffering and copying process. Unfortunately, copying can impose significant overhead, which results in turn in increased latency, reduced throughput, and/or increased CPU load when performing I/O.
Another known technique is the remapping of memory regions (in particular, pages) as described in U.S. Pat. No. 6,075,938, Bugnion, et al., xe2x80x9cVirtual Machine Monitors for Scalable Multiprocessors,xe2x80x9d issued Jun. 13, 2000 (xe2x80x9cBugnion ""938xe2x80x9d). The basic idea of this system, which operates in the context of a NUMA (non-uniform memory access) multi-processor, is that memory pages associated with hardware memory modules that are farther away (defined in terms of access latency) are migrated or replicated by making copies in hardware memory modules closer to a process that is accessing them. The process page mappings are modified transparently to use the local page copy instead of the original remote page. In other words, the Bugnion ""938 system attempts to improve access speed by improving memory locality. The problem when it come to I/O, in particular in the context of DMA, is, however, not that of whether a certain memory space is sufficiently local, but rather, often, whether it can be accessed at all.
Still other existing systems enable I/O to xe2x80x9chighxe2x80x9d memory by including special hardware components that provide support for memory remapping. For example, a separate I/O memory management unit (I/O MMU) may be included for I/O operations. The obvious disadvantage of this solution is its requirement for the extra hardware.
A related problem is the dynamic management of the xe2x80x9clowxe2x80x9d memory, which may be a scarce resource that needs to be allocated among various competing uses. In other words, if several devices or processes must compete for use of a common memory region (here, xe2x80x9clowxe2x80x9d) designated for high-speed I/O (such as DMA), then some mechanism must be provided to efficiently allocate its use. Such memory management is typically carried out by a component of the operating system.
What is needed is therefore a system that eliminates or at least reduces the need for copying in I/O operations to or from at least one limited memory space, especially in high-speed I/O contexts such as DMA. The system should preferably be usable not only in a conventional computer system, in particular, in its operating system, but also in computer systems that include at least one virtualized computer. Moreover, the system should preferably also be able to manage the limited memory space dynamically, and it should not require specific hardware support. This invention provides such a system and method of operation whose various aspects meet these different goals.
The invention provides a method and corresponding system implementation for performing an input/output (I/O) operation in a computer between an I/O-initiating subsystem and a device through a memory, where the memory is arranged into portions such as pages that are separately addressable using first identifiers, such as page numbers. It is assumed that, for the I/O operation, the device accesses a device-accessible space of the memory, whereas the subsystem addresses I/O requests using second (or, in the preferred virtualized embodiment, third) identifiers to some other memory space, in particular to a space of the memory that is inaccessible to the device. In other words, the subsystem does not normally address I/O requests to the region of the memory that the device accesses for I/O operations. One example of this would be DMA where the device addresses only a lower address region of the memory but the I/O-initiating subsystem addresses its requests to an upper address region.
According to the invention, a manager, in particular, a memory map within the manager, initially maps the second identifiers to respective first identifiers that identify portions of the memory in the device-inaccessible memory space. For any I/O request that meets a remapping criterion, a remapping module in the manager remaps the corresponding second identifier to one of the first identifiers that identifies a portion of the memory in the device-accessible space of the memory.
In cases where the I/O operation is output of a data set from the subsystem to the device, that is, a xe2x80x9cwrite,xe2x80x9d then for any I/O request that meets the remapping criterion, and for as long as the I/O request meets the remapping criterion, the manager creates and maintains a single copy of the data set in a buffer in the device-accessible space of the memory and remaps the I/O request to the single copy. For any I/O request that fails to meet the remapping criterion, a new copy of the data set is preferably created in the buffer upon each instance of the I/O request.
In the cases where the I/O operation is input of a data set from the device to the subsystem, that is, a xe2x80x9cread,xe2x80x9d then, for any I/O request that meets the remapping criterion, the data set from the device is preferably stored in the device-accessible space of the memory at a location identified by the first identifier to which the second identifier has been remapped.
One way according to the invention to decide which second identifiers are to be remapped to the device-accessible space of the memory is to calculate an activity score for at least a subset of the second identifiers used by the subsystem in an I/O request during a current measurement period. The second identifier is then remapped if its activity score exceeds a high-activity threshold value. The activity score may be calculated in different ways, for example, as a function of the frequency of use in I/O requests by the subsystem of the respective second identifier. The frequency may be determined over some current activity interval, or over both a current activity interval and over at least one previous activity interval.
According to one aspect of the preferred embodiment of the invention, each second identifier that is currently mapped into the device-accessible space of the memory and that meets a remapping condition is again mapped into the device-inaccessible space of the memory. Identifiers may be chosen randomly for mapping back into the device-inaccessible space of the memory, or they may be chosen because their activity score falls below a low-activity threshold during a most recent activity interval, etc. If the mapping for a particular second identifier is changed back to the device-inaccessible space of the memory, then the portion of the memory in the device-accessible space to which the second identifier had previously been remapped may be freed for reallocation.
Remapping may also, or alternatively, be conditioned on and adjusted as a function of a measure of scarcity of the portions of the memory in the device-accessible space that are available for allocation.
Yet another criterion that may be used instead of or in addition to those mentioned above is that the current cost of remapping for the current I/O request is less than a cost threshold. The manager then preferably includes a cost-evaluation module to calculate cost according to the chosen definition of cost.
In the preferred embodiment of the invention, the I/O-initiating subsystem is included within a guest system that runs on a host platform. An additional degree of addressing indirection is thereby introduced, such that I/O requests issued by the I/O-initiating subsystem are first remapped from third identifiers to the second identifiers within the guest system. The remapping of the second identifiers to the first identifiers is then done in the manager such that the remapping remains transparent to both the I/O-initiating subsystem and the guest system.
The preferred embodiment of the invention is virtualized, meaning that the guest system is a virtual machine that has a guest operating system and that runs on the underlying hardware host via an intermediate software layer such as a virtual machine monitor. In this virtualized embodiment, the I/O-initiating subsystem is in the virtual machine (for example, an application, or even the guest operating system itself) and the actual I/O operation is between the I/O-initiating subsystem and a physical device through a hardware memory. The first identifiers thereby identify portions of the hardware memory.
Although actual I/O is to and/or from the physical device through the hardware memory, the I/O-initiating subsystem directs its I/O requests, using third identifiers, to a virtual device through a virtual memory. The I/O requests issued by the I/O-initiating subsystem are then remapped by the guest operating system from the third identifiers to second identifiers, which identify portions of the virtual memory. The manager is in this case in the intermediate software layer, which maintains a map of either the second or third (or both) identifiers to the first identifiers. Remapping to and from the physical device-accessible space of the hardware memory is then done as above, on the basis of the activity of either the second or the third identifiers.
The transparent remapping mechanism according to the invention may alternatively be used to remap second (or third) identifiers that identify contiguous memory portions such that the first identifiers to which they are remapped correspond to contiguous portions of the device-accessible space.