Most modern computer systems provide mechanisms for sharing memory between a main central processing unit (CPU) and accelerator devices (e.g., input/output (IO) devices). For example, modern video cards, one example of an accelerator device, need to read large amounts of data from the main system memory. Fast network adapters, another exemplary accelerator device, will read data to transfer directly from memory and write packets that just arrived over the network directly into the system memory. Disk controllers, still yet another example of an accelerator device, may read data directly from memory and write it to magnetic disk media, or read data from the media and store it in the memory. These exemplary accelerator devices and others, while commonly separate, may ultimately form part of a device (e.g., a chip package) that includes the CPU or be formed on or with the CPU on a single chip.
Direct Memory Access (DMA)
One technique commonly used to share memory between a main CPU and accelerator devices is called Direct Memory Access (DMA). The operating system will configure a block of physical memory to be directly accessed by an accelerator device. This is typically done by directly programming special registers in the DMA controller. The DMA controller will then transfer the data between the system memory and the accelerator device. However, with simple DMA controllers the block of system memory may have to be physically continuous. This may be a problem for certain devices that need to transfer large amounts of data at a time. This is because allocating a large continuous block of physical memory may not always be possible due to memory fragmentation.
Scatter-Gather DMA
More sophisticated DMA controllers can work around this restriction by supporting scatter-gather DMA transfers. Such controllers can work with memory buffers that are not continuous in physical memory. For example, a linked-list mechanism could be used to describe all the fragments that the memory buffer is made of. The DMA controller will then automatically follow the linked-list during the transfer of the data between the buffer and the accelerator device.
Using DMA is relatively easy if a sufficiently large memory block (to be used for DMA transfers) can be pre-allocated ahead of time. This is a very common approach used by many device drivers. However, DMA transfer may not be possible or practical in some cases. For instance, it is often not possible to pre-allocate the memory buffer because the memory where the data is to be transferred to/from is allocated by a user process that may not be aware of the DMA.
For instance, consider an application that reads a data file (e.g., such as a large video file) from disk, processes it, and sends it over a network. In this case, the application may allocate a block of virtual memory where the data is read from disk. It may then process the data in place or it may copy the processed video file to another block in virtual memory. Using the DMA approach, the application will invoke an application programming interface (API) that will cause the OS or a device driver to pin the data block in physical memory, program the DMA controller and transfer the data between the physical memory and accelerator device. With simple DMA controllers multiple DMA transfers may be required if the data is fragmented. Even if the DMA controller supports the scatter-gather mechanism, there is still a significant burden on the OS, device driver, application and programmer to ensure that all the components correctly cooperate. The application may have to invoke special APIs to allocate the buffer. If the data is already located at another location in virtual memory, the CPU may have to copy the data to the DMA buffer before the actual DMA transfer can begin. The programmer must constantly take into account the requirements of the DMA controller (which may differ from system to system), ensure that the memory is not used while the DMA transfer is in progress, prevent the memory from being released until the DMA transfer completes, etc.
Limitations of DMA in Modern Multi-Threading Environments
In modern systems where the programmer has to control multiple threads that are executing concurrently, all these tasks become even more complex and can easily lead to errors and/or performance issues. For example, a programmer may want to take advantage of an accelerator device (e.g., a graphics processing unit—aka a GPU—which may be present on a video card, on the same board as the CPU, or integrated into another device such as a northbridge or the CPU) to perform some computational work on the data. The data may already be in memory, but that memory may not be accessible to the DMA controller. As such, the application will have to allocate another memory buffer that is accessible to the DMA controller, copy the data to the buffer, initiate DMA transfer, wait for the accelerator device to process the data, transfer (via DMA) the results back into system memory, and finally copy the data to a desired location in the application virtual memory address space. In this example there are two memory-to-memory copy operations and two DMA transfers between the system memory and the accelerator device. Further, at least parts of the memory had to be pinned (to prevent them from being swapped out or reused by another process). The memory-to-memory copy operations are typically very computationally expensive and can easily lead to major performance degradation. Even if the memory-to-memory copy operations can be avoided, there may still be other problems. Examples of other problems include cache coherency (e.g., DMA transactions typically require that cache is disabled for the specified memory address range), security problems (e.g., it may be difficult to indicate what parts of memory are accessible for read/write operations and what parts of memory are read-only), etc. In general this approach is better suited for programs that require few large memory transfers rather than for programs that require frequent but small memory transfers from different locations in the program virtual memory address space.
Graphics Address Remapping Table (GART)
A more complex mechanism that tries to solve the memory sharing problem is the Graphics Address Remapping Table (GART) which is used by a GPU to perform physical-to-physical translation of memory addresses. It provides the GPU with a virtual, contiguous memory view, even though the actual data may be scattered across randomly allocated pages.
One limitation of GART is that it requires prior knowledge of where the data (that will be accessed by the GPU) resides, so that the CPU can map it as appropriate in the GART tables. This again creates serious restrictions on the programmer. There is no easy workaround if the GPU unexpectedly needs to access data in memory that is not currently mapped in the GART tables.
Another limitation is that there is no concept of multiple address spaces in GART. The GART tables may have to be re-programmed every time there is a context switch between the processes that use the GPU for computations. If two different processes try to use the GPU at the same time, the GART tables may need to contain mapping entries that point to memory used by both processes. This creates potential security problems because it allows one process to execute code on the GPU that could potentially access the memory of another process (via GART mechanism).
Thus, there are significant limitations when using DMA or GART mechanisms. A significant drawback is the fact that the OS, driver and/or application must know, in advance, what memory would be accessed by the IO device to properly configure the DMA or GART before the IO device began processing the data. For example, the IO device might obtain the address of the data in virtual memory as part of its computation. With the DMA or GART approach the device could not simply access the data at the new address. Instead it would require intervention from the OS/driver/application to reconfigure the DMA/GART to allow the IO device to access the new memory location.