Direct memory access (DMA) is a feature of modern computers that allows certain hardware subsystems within the computer to access system memory for reading and/or writing independently of the central processing unit. Many hardware systems use DMA including disk drive controllers, graphics cards, network cards, and sound cards. Computers that have DMA channels can transfer data to and from devices with much less Central Processing Unit (CPU) overhead than computers without a DMA channel.
Without DMA, using programmed input/output (PIO) mode, the CPU typically has to be occupied for the entire time it is performing a transfer. With DMA, the CPU can initiate the transfer, do other operations while the transfer is in progress, and receive an interrupt from the DMA controller once the operation has been completed. This is especially useful in real-time computing applications where not stalling behind concurrent operations is critical.
A typical usage of DMA is copying a block of memory from system RAM to or from a buffer on the device. Such an operation does not stall the processor, which as a result can be scheduled to perform other tasks. DMA transfers are essential to high performance embedded systems. They are also essential in providing so-called zero-copy implementations of peripheral device drivers as well as functionalities such as network packet routing, audio playback and streaming video.
Scatter/gather is used to do DMA data transfers of data that is written to noncontiguous areas of memory. A scatter/gather list is a list of vectors, each of which gives the location and length of one segment in the overall read or write request.
There are many variants of Scatter-Gather List (SGL) format, one example of which is defined in IEEE 1212.1 Block Vector Structure Specification. The format of an SGL element with a chaining example is shown FIG. 1. Within each scatter/gather element is a 4-byte buffer length and an 8-byte buffer address. There is also a 4-byte reserved field, for alignment, with the most significant bit defined as the extension bit (ext). An extension bit set to logical ‘1’ designates the descriptor as pointing to a chained buffer of scatter/gather descriptors. Only the last scatter/gather descriptor may chain, it does not have to chain. A chained scatter/gather list may chain to another scatter/gather list. The end of the scatter/gather list is realized by matching the scatter/gather count.
A buffer length of zero, as shown in the fourth entry 40, signifies that no data is transferred for that scatter/gather element. It does not signify end of list, nor does it have any other special meaning. In addition to the above IEEE defined fields, the bit immediately to the right of the extension bit in the SGL element (eob—byte 15, bit 6) is reserved for indicating whether the SGL element is the last element for that SGL list. This bit is called the end-of-buffer (eob) bit and when set to a logical ‘1’ indicates that the particular SGL element is the last element for that particular SGL list. The DMA ideally will not request a data length that goes beyond the cumulative length indicated by this last element for a given SGL list. If the DMA requests data beyond the last SGL element's size, the Scatter-Gather Block will trigger an error interrupt, and will freeze all operations.
A DMA structure supporting SGL is a common feature of storage controller and high performance network interface cards. High-end storage controllers for Small Computer System Interface (SCSI), Serial Attached SCSI (SAS), or Fiber Channel controllers typically support a large number of directly or indirectly attached target devices, and support a number of concurrent input/output (I/O) commands per target device. Each of the outstanding commands (e.g. SCSI I/O Read or Write) is associated with at least one pre-allocated data buffer that either holds the data to be transmitted for a Write command, or provides the space to receive the data from the execution of a Read command from SCSI protocol perspective, each of the data buffers is addressed linearly as data is transferred, while physically the data buffer can be fragmented in non-contiguous regions.
The SGL is typically used to represent a user data buffer that is pre-allocated for each outstanding I/O. Typically, the storage interface bus, such as SAS links, are shared by multiple target devices when these devices are indirectly attached through expanders. As a result, the data frames from the concurrent I/O are time interleaved over a physical bus interface, each frame representing a portion of data belonging to a larger I/O. To deliver the data into the appropriate buffer associated with the I/O, the DMA engine needs to switch context from one SGL to another at the boundary of frame sequences representing different I/Os. This requirement of context switching between partial transfers among different SGLs imposes significant challenges on the DMA design as the DMA needs to track the current position of transfer at each SGL.
As noted before, physically, a data buffer is organized as a sequence of buffer fragments, as denoted by SGL. There are several reasons why the data buffers need to be fragmented.
Page fragments: The first reason is virtual memory management in the host CPU and operating system. Modern CPUs support virtual memory via the intelligent Memory Management Unit (MMU), which utilizes a hierarchy of segment and or page tables to map a logically contiguous user memory space for each process into the physical memory hierarchy, for protection of one user space from another, and to provide a linear view of memory from each user process. This also allows the logical memory space to be much larger than the actual physical main memory space by swapping a certain region of logical memory that is currently not in use with much larger disk swap space. Before a data buffer can be used as a DMA data buffer, typically, the application layer allocates a data buffer in virtual address space, the kernel or device driver page locks the virtual address buffer to ensure the entire virtual address buffers are loaded and fixed in physical main memory space (no swapping to disk). Since the virtual to physical address translation is done based on MMU pages (e.g. 4K byte long physical memory that is perfectly aligned at 4K address boundaries), the virtual buffer is now mapped into a sequence of physical pages, each page being uniform in size and alignment that can be presented by a SGL. However, since the virtual address buffer can start at arbitrary byte address granularity, the first byte of the virtual address buffer can start from an arbitrary byte offset of a physical page. In other words, the SGL represents a sequence of uniform size pages that is page aligned, except for the first fragment that can start at arbitrary byte offset of a page, and the last fragment can end at arbitrary byte offset of another page.
Arbitrary fragments: The second form of buffer fragment can be much more constraint-free. This is often caused by an application directly using arbitrarily arranged fragments (with no size or alignment constraints) in the user space (either virtual memory of physical memory space) and using these as an I/O buffer. For example, a modern operating system (OS) supports the file system of an I/O subsystem Application Programming Interface (API) that accepts SGL as a buffer argument for disk I/Os. The purpose is to minimize unnecessary memory movement in software. For example, a user program wants to write some data fields from various data structures into a file. Instead of allocating a contiguous data buffer in the virtual address space as a temporary workspace to copy all the necessary fields before issuing the I/O from the workspace buffer, the user program chooses to create a SGL with each entry pointing to the direct location of the necessary data structure fields to be written, and then issues a write I/O operation to the file system using SGL as the argument representing the I/O buffer. This creates an I/O operation using an arbitrary SGL with the benefit of eliminating the extra step of managing the workspace buffer and the data movement between data structure and workspace.
There are a number of well-known DMA techniques that suffer from the following disadvantages.
DMA addressing: The majority of known DMA techniques operate in physical address space. This means the requestor of a DMA operation specifies a DMA request using physical addresses, or an SGL that contains physical address information for each DMA operation. This approach is quite intuitive and simple when handling data movement in contiguous data buffers. However, when the DMA operation needs to do context switching between partial transfers using different SGLs, the use of physical addressing places a significant burden on the DMA master (requestor). To enable the DMA to resume data transfer on a partial SGL buffer, the DMA needs to save much information in SGL partial transfer context, including: the current pointer in SGL, the head pointer to the SGL, the current fragment physical address, and the remaining byte count within the current fragment. Such context needs to be managed on per concurrent SGL basis. When the DMA resumes data transfer on an SGL buffer, the DMA needs to reload the partial context to allow proper physical address calculation. The SGL partial context not only adds significant complexity to both the DMA engine and the DMA master, but also adds cost for the context storage, and reduces the performance of the DMA engine because of the extra processing step involved in context management. This problem can be particularly severe in a storage controller application that needs to support a large number of concurrent I/Os (SGLs) that are time interleaved over the physical bus.
There are some DMA methods that support data transfer based on virtual addresses. This approach utilizes an address mapping structure analogous to CPU MMU. A Table Lookup Buffer (TLB) structure is used to implement a virtual address to physical address translation scheme. This approach is well suited for limited SGL buffers denoted as “page fragments” described above. However, because of the page index based lookup structure, this approach can only handle uniform size buffer fragments. Therefore, it cannot support “arbitrary fragments” that have no restrictions on the alignment and size of each buffer fragment.
Due to the complexity of SGLs involved, known DMA structures have various degrees of difficulty in supporting time interleaved partial sequential transfers with multiple SGLs, and/or random partial transfers using an SGL. It is worth noting that random partial transfers with SGL, although rare, are a necessary function to support modern storage protocols, such as SAS, that generate requests that can move the current position within a SGL buffer to a random offset (most likely backwards) while handling transport layer retry conditions.
Concurrent data transfers and request queue organization: Known DMA structures typically sit on an arbitrated system bus, which connects multiple bus masters to slaves such as the memory controller that provides access to main system memory. The DMA being a bus master can arbitrate for access of the slave (i.e. the main memory) and when the access request is granted, the DMA generates bus transactions to perform memory read or write operations. When there are multiple slave memory spaces, such as off-chip main memory space connected through a memory controller, Peripheral Component Interconnect (PCI) host memory space connected through a PCI controller, and on-chip memory space, these memory spaces are treated as independent system bus slave devices that the DMA can access through the system bus interface.
While the independent memory interfaces can operate in parallel, known DMA structures and system bus interconnects limit the concurrency of these memory spaces due to a number of common architectural characteristics causing lack of concurrent switching within the DMA datapath. For example, a shared system bus limits the transactions to one master-slave pair at anytime. As a result, when the DMA is accessing one memory interface, it cannot transfer data with a different memory transfer. In another example, of a non-blocking switch based system bus interconnect, the DMA only occupies one physical port of the system bus switch. In this state, even though the system bus allows multiple masters to access multiple slaves in a non-colliding traffic pattern, the DMA cannot transfer data with two independent slaves (memory spaces) simultaneously limited by the master port occupied by the DMA engine, because the DMA is connected to the system bus switch through one shared physical port for accessing all of the memory spaces.
Another common architectural characteristic is a Shared Request queue structure. Known DMA approaches tend to use common request First Come First Serve (FCFS) queues that are shared by data transfers in all directions, wherein the direction of a transfer is defined by the source memory space-destination memory space pair. Even though many DMA structures support multiple queue organizations based on priority or type of transfer, the lack of segregation of request queues based on direction of data movement fundamentally limits the parallelism of data transfer because of Head of Line (HOL) blocking issue. Consequently, such DMA engines cannot fully utilize the parallel bandwidth of the physical memory spaces. For example, if request A wants to move a piece of data from PCI to Double Data Rate (DDR) memory, while request B wants to move another piece of data from internal memory to PCI. Even though the physical memory spaces (PCI interface read, DDR write, internal Random Access Memory (RAM) read, PCI interface write) can support the parallel execution of transfers A and B, when A and B are posted into a common queue in sequence, such two transfers will take place sequentially, resulting in idle time of the memory bus interfaces at various time stages, which in turn means lower system throughput, longer processing time for a given task, more waste of bandwidth on the memory and external interfaces.
SGL caching: Known DMA engines that handle SGL require the DMA engine, or the DMA master/requestor to keep track of the SGL context for each list, including the pointer to the current SGL entry, the current offset within the SGL fragment, the pointer to the head of the SGL, etc. Or, alternatively, for prior art architectures that do not keep SGL context, the DMA engine is required to perform full SGL traversal for each DMA transfer using SGL. The first approach not only adds the cost of context storage on a per SGL list basis, but also adds significant complexity to the DMA master for the interpretation of SGL format, SGL traversal, context maintenance and manipulation.
Internal switch—Virtual Output Queuing (VOQ): Known DMA engines use a combination of a VOQ buffer and crossbar switch with VOQ arbiter for achieving non-blocking data transfer between input and output ports of the crossbar. The application of known crossbar arbitration techniques requires the data transfers to be divided into fixed time slots, corresponding to fixed data cell sizes, so that all ports can operate in lockstep based on a fixed time scale. Due to speed differences among the different memory spaces, applying fixed time slot techniques requires a certain amount of output buffer to be reserved for rate adaptation, and for adaptation between different native burst sizes.
Port trunking: Known DMA engine throughput is limited to the speed of the individual physical port of the memory interface. There is no known DMA method that can increase the data throughput via the use of striping data across multiple physical ports to the same memory space while preserving the ordering or DMA operations and indications.
Hole Insertion/Removal: Known DMA engines lack the capability to insert or remove holes within the data stream based on pre-defined fixed spacing between the adjacent holes and the pre-defined gap size of the hole. Such a feature can be useful for handling Data Protection Information (DPI) which requires the insertion of a checksum and tags on a per sector basis.
Endianess transformation: Known DMAs operate on consistent bus endianess format. Hence, they are incapable of transferring data between buses with different width and endianess definitions. A system where such a requirement exists would be, for example, a System On Chip (SOC) having a big-endian 32-bit CPU that needs to transfer a block of data to a PCI space that organizes data in 64-bit little-endian format.
Descriptor pipelining to accommodate very long bus latency: Known DMAs process one DMA descriptor at a time. Some designs pre-fetch the next DMA descriptor while the current DMA descriptor is in progress to overlap the time of descriptor fetching and the DMA transfer. Such designs with single or dual descriptors in the processing pipeline are sufficient to achieve high system throughput when the latency for a descriptor is low compared to the processing time for the actual DMA transfer. However, for systems where the DMA transfer is dominated by small transfers (transfer a small number of bytes) and the bus latency for descriptor fetching is low, the throughput declines because the DMA incurs idle time waiting for DMA fetching due to the long latency. To achieve high throughput in high latency systems for small DMA transfers, novel architecture enhancements are necessary.
DMA Bypass Mode: Known DMA controllers do not support DMA transfer where the descriptor is fetched and written back immediately without transferring data from source node to sink node. This feature could be useful in system level performance analysis.
It is, therefore, desirable to provide an improved DMA approach that overcomes one or more of the disadvantages of current DMA approaches.