1. Field of the Invention
This invention relates to the field of memory management in computers, in particular in the context of address mapping in order to improve I/O speed.
2. Description of the Related Art
Many computer systems depend for their speed and efficiency on the ability to rapidly transfer data between devices and system memory. In many cases, however, addressing conventions and restrictions make it necessary to perform intermediate copies of data to be transferred before the final transfer can actually take place. Such copying can severely slow down the transfer rate.
One widely used method for increasing the input/output (“I/O”—either or both) speed between certain devices (or other processes) and memory is known as “direct memory access” (DMA). DMA is a capability provided by some computer bus architectures that allows data to be sent directly from an attached device (such as a disk drive) to system memory, without intermediate action by the processor. In order to implement DMA, a portion of system memory is usually designated as an area to be used specifically for DMA operations. Obviously, time is lost whenever a block of data (such as a “page” that is not already in the designated memory portion) must be copied to or from the designated memory portion to perform a DMA transfer.
As a concrete example, modern Intel x86 processors support a physical address extension (PAE) mode that allows the hardware to address up to 64 GB of memory using 36-bit addresses. Unfortunately, many devices that directly access memory to perform I/O operations can address only a subset of this memory. For example, network interface cards with the common 32-bit PCI (Peripheral Component Interconnect) interface can address memory residing in only the lowest 4 GB of memory, even on systems that support up to 64 GB of memory. Other 32-bit PCI devices can access memory above 4 GB using a technique known as DAC (Dual Address Cycle), but this technique requires two address transfers—one for the low 32 bits and another for the high 32 bits.
One known way to support output to “high” memory (that is, memory above 4 GB) is to copy the data from high memory to a temporary buffer in “low” memory for the DMA operation. For input operations, a portion of low memory in the temporary buffer is allocated for storage of the input data, which can then be copied to high memory. This technique is employed, for example, by the Linux 2.4 kernel, which uses the term “bounce buffer” to describe the temporary buffering and copying process. Unfortunately, copying can impose significant overhead, which results in turn in increased latency, reduced throughput, and/or increased CPU load when performing I/O.
Another known technique is the remapping of memory regions (in particular, pages) as described in U.S. Pat. No. 6,075,938, Bugnion, et al., “Virtual Machine Monitors for Scalable Multiprocessors,” issued 13 Jun. 2000 (“Bugnion '938”). The basic idea of this system, which operates in the context of a NUMA (non-uniform memory access) multi-processor, is that memory pages associated with hardware memory modules that are farther away (defined in terms of access latency) are migrated or replicated by making copies in hardware memory modules closer to a process that is accessing them. The process page mappings are modified transparently to use the local page copy instead of the original remote page. In other words, the Bugnion '938 system attempts to improve access speed by improving memory locality. The problem when it come to I/O, in particular in the context of DMA, is, however, not that of whether a certain memory space is sufficiently local, but rather, often, whether it can be accessed at all.
Still other existing systems enable I/O to “high” memory by including special hardware components that provide support for memory remapping. For example, a separate I/O memory management unit (I/O MMU) may be included for I/O operations. The obvious disadvantage of this solution is its requirement for the extra hardware.
A related problem is the dynamic management of the “low” memory, which may be a scarce resource that needs to be allocated among various competing uses. In other words, if several devices or processes must compete for use of a common memory region (here, “low”) designated for high-speed I/O (such as DMA), then some mechanism must be provided to efficiently allocate its use. Such memory management is typically carried out by a component of the operating system.
What is needed is therefore a system that eliminates or at least reduces the need for copying in I/O operations to or from at least one limited memory space, especially in high-speed I/O contexts such as DMA. The system should preferably be usable not only in a conventional computer system, in particular, in its operating system, but also in computer systems that include at least one virtualized computer. Moreover, the system should preferably also be able to manage the limited memory space dynamically, and it should not require specific hardware support. This invention provides such a system and method of operation whose various aspects meet these different goals.