A DMA transfer essentially copies a block of memory from one device to another. The block of memory that resides in these devices may be further subdivided into smaller chunks that may not be contiguously located. For example, a 4 MB chunk may be located as 4 separate 1 MB chunks anywhere in the memory space of the device. Therefore, some information is needed as to their physical locations so that the DMA Master (the DMA controller) can then use this information to either collect the data from these separate chunks (Gather) or write data into these separate chunks (Scatter). This is where a Scatter/Gather element comes into the picture.
A Scatter/Gather element contains the physical location of one memory chunk (also called a fragment) along with the size of the data contained in that chunk. A number of Scatter/Gather elements together can describe the locations and sizes of the chunks of memory that make up the block of data to be transferred. The format of a Scatter/Gather element can be different depending upon the application. For the purpose of uniformity, the IEEE 1212.1 compliant Scatter/Gather element, which is illustrated in FIG. 1, will be described.
As shown in FIG. 1, a typical Scatter/Gather element has the following fields: a 64-bit Address field 100 that points to the starting location of the fragment in memory; a 32-bit Length field 102 that indicates the amount of data contained in that particular fragment; a 31 bit Reserved field 104 that is set to zeroes; and a 1 bit Extension (Ext) field 106 that indicates whether this element is a pointer to the next SG element or not. This Extension field 106 is needed because the SG elements themselves may not be stored contiguously in memory. In this case, the Address field 100 of an SG element can be used to point to the location of the next SG element in the list. For such an SG element, the Length field 102 is ignored and the Ext 106 bit will be set. A Scatter/Gather element may also have a Length field set to all zeroes, which can mean: that the DMA controller should ignore the contents of this element and move on to the next element in the list; or that the block is empty.
FIG. 2 shows how a Scatter/Gather List (also called SGL, a chained list of Scatter Gather elements) can be used to completely specify a block of memory in a device. As shown in FIG. 2, Fragments 0 through 4 are located at non-contiguous and random locations in physical memory 108 (which may reside in different memory spaces). The SGL 110 however puts all of these together by having SG elements 112 that point to the starting location of each fragment. As we traverse the list, we appear to have a contiguous logical memory block, whose total size is the combined sizes of all of the fragments. An illustration of such a logical memory block 114 is shown in FIG. 2 for illustrative purposes, though it is understood not to exist physically.
Notice in the example of FIG. 2 that the SGL 110 itself is not contiguously located in physical memory. The fifth SG element of the first set of SG elements points to the next SG element in the list by using the extension capability of the SGL. Also notice that we cannot traverse the list backwards—for example, we cannot go back to the fifth SG element once we traverse on to the sixth one, as we have no information in the sixth SG element that points back to the address of the fifth SG element.
The DMA controller may have a number of SGLs in memory, each corresponding to a different logical block of memory that is involved in a data transfer. Each SGL may be identified using a unique data word, also called a descriptor. Each descriptor typically contains the starting location of a particular SGL (or SGLs) in physical memory, which physical memory contains the SGL(s) (if there are multiple separate physical memories), the total size to be transferred, and other details pertaining to that particular data transfer. This way, the CPU can simply instruct the DMA controller to initiate a data transfer by giving it the descriptors. The DMA controller can then find the starting address of the first SGL using the descriptor, and then proceed to transfer data by using the information obtained from traversing the SGL.
The starting address of the SGL itself can be 64 bits (depending on the system), which could make the descriptor large. In order to conserve space on the descriptor fields, descriptor information can be stored in physically contiguous locations in memory and the descriptor itself can be used to point to this information. This memory structure is called a descriptor table. In this case, the descriptor itself can be reduced to a simple index, which can then be manipulated and then added to an offset to arrive at the location of the actual contents of the descriptor in physical memory.
FIG. 3 illustrates a scatter gather list descriptor table. For the purposes of illustration, assume that each entry in the descriptor table 116 holds only the starting address of the SGL. Each descriptor 118 is simply represented as an integer and is nothing more than an index in this case. To locate the entry in the descriptor table, the descriptor is multiplied by 8 bytes (since each descriptor entry is 64-bits wide and holds the starting address of the SGL) and an offset value (0x1000 in this case) is added to the multiplied value to arrive at the location 120 of that descriptor's contents. In the case where the descriptor value is 1 for example, we find the contents of the descriptor at memory location (1*8)+0x1000=0x1008 in physical memory 122. We can then use the contents at this memory location (0xffe0 in this case) to get our first SG element in the SGL.
After the data transfer is complete, the DMA controller will interrupt the CPU to inform of a successful transfer. The CPU may then ‘retire’ the descriptor, wherein it may re-use this particular descriptor for another DMA transfer by storing the starting address of a completely different SG list in the descriptor table. (In the example above in FIG. 3, the CPU will overwrite the address 0x1008 with a value other than 0xffe0). Until now, we have assumed only a simple descriptor (only an index) and a simple descriptor table (SG element address) for this example. In reality, the descriptors may hold many more bits that may be used to indicate other parameters in the DMA transfer.
A structure of a more complex descriptor is shown in FIG. 4 and relevant portions are described below. A Source Descriptor Index 124 (N bits wide) holds the descriptor index that is required by the controller to locate the Descriptor table for the Source of data for the transfer. Src DT Location 126 (M bits wide) bits indicate which memory space contains the Descriptor Table for the Source of the Data Transfer, such as in the case where there are multiple memories in the system. For example, there can be 3 addressable memory spaces—a DDR DRAM memory space, a PCI Host memory space and a GSM on-chip embedded memory space. This scenario can apply to each of the portions described below in relation to FIG. 4 that indicate which memory contains a certain element of interest.
Src SGL Location 128 (P bits wide) bits indicate which memory contains the Scatter Gather List for the Source of the data transfer. Dest Descriptor Index 130 (N bits wide) holds the descriptor index that is required by the controller to locate the Descriptor table for the Destination of data for the transfer. Dest DT Location 132 (M bits wide) bits indicate which memory contains the Descriptor Table for the Destination of the Data Transfer. Dest SG Location 134 (P bits wide) bits indicate which memory contains the Scatter Gather List for the Destination of the data transfer. Finally, Transfer Size 136 (Y bits wide) indicates how many total bytes are to be transferred for this particular DMA operation.
Using the descriptor and the SGLs, a DMA controller (DMA Master) can transfer data to and from devices. The DMA Master will read through the descriptors, locate the SGLs and then proceed to transfer information from one device to another. Some DMA controllers may use temporary buffers that hold the data read from one device, before it is written into the other device. For example, a DMA controller may choose to transfer 1 KB at a time between devices until the entire transfer is finished. Therefore it will first traverse as many source device SG elements as it needs to fill up this 1 KB buffer. It will then proceed to write this 1 KB by reading as many destination device SG elements. This is usually done for performance and ease of transfers.
Fragment Size and Alignment:
Consider virtual memory management in a Host CPU and operating system. Modern CPUs use intelligent MMUs, which utilize a hierarchy of segment and/or page tables to map a logically contiguous user memory space for each process into the physical memory hierarchy, for protection of one user space from another, and provide a linear view of memory from each user process. Furthermore, this also allows the logical memory space to be much larger than the actual physical main memory space by swapping certain regions of logical memory that are currently not in use with much larger disk swap space.
Before a data buffer can be used as a DMA data buffer, typically, the application layer allocates a data buffer in virtual address space. The kernel or device driver page lock the virtual address buffer to ensure the entire virtual address buffers are loaded and fixed in physical main memory space (no swapping to disk). Since the virtual to physical address translation is done based on MMU ‘pages’ (e.g. 4K byte long physical memory that is perfectly aligned at 4K address boundaries for example), the virtual buffer is now mapped into a sequence of physical pages, each page being uniform in size and alignment that can be presented by a SGL.
However, since the virtual address buffer can start at arbitrary byte address granularity, the first byte of the virtual address buffer can start from an arbitrary byte offset of a physical page. In other words, the SGL represents a sequence of uniform size pages that are page aligned, except for the first fragment that can start at an arbitrary byte offset of a page, and the last fragment can end at an arbitrary byte offset of another page. This approach is well suited for limited SGL buffers denoted as “page fragments”, where the size and alignment of a fragment is fixed. But because of the page index based lookup structure, this approach can only handle uniform size buffer fragments, therefore can not support “arbitrary fragments” that have no restrictions on the alignment and the size of each buffer fragment.
Performance:
Assume that Scatter Gather Lists contain extension elements, which means that the DMA controller has to traverse the list for a while before getting to the next SG element that contains valid fragment information. FIG. 5 shows how a typical DMA controller may spend its time on a DMA operation (either when Reading or Writing).
As shown in FIG. 5, the Master first spends time 138 on locating the Descriptor Table to get the address of the first SG element. Once this has been obtained, the Master then traverses the SG list until it finds the first SG element that contains a data fragment (this portion of time is indicated as ‘SG frag 1’ 140 in FIG. 5). The DMA Master then transfers data to/from the fragment during time 142. When this is finished, the DMA Master then searches for the next fragment to transfer data, and thus once again traverses the SG List to find the next fragment during time 144. Once the second fragment has been found, the Master can now transfer data to/from the second fragment during time 146. Other time periods 148 and 150 represent similar searching and data transfer, which can be repeated for the required number of SG elements. As we can see, the efficiency of data transfers is affected because the Master has to traverse SG lists between data transfers in order to find fragments. In reality, the performance will be even worse, as the Master has to fetch the SG Lists of both the Source and the Destination when transferring data between them. Also note that the time taken to fetch SG elements keeps increasing as the Master has to traverse down the list, because it has to skip over n−1 SG elements to find the nth element, which further degrades performance.
Maintaining Context:
The majority of known DMA operates in physical address space. This means the requestor of a DMA operation specifies a DMA request using physical addresses, or a scatter gather list that contains physical address information on each DMA operation. This approach is quite intuitive and simple when handling data movement in contiguous data buffers. But when the DMA operation needs to do context switching between partial transfers using different scatter-gather lists, the use of physical addressing pushes a significant burden on the DMA Master (requestor). To enable the DMA to resume data transfer on a partial SGL buffer, the DMA Master needs to save much information in SGL partial transfer context, including: the current pointer in SGL, the head pointer to the SGL, the current fragment physical address, the remaining byte count within the current fragment. Such context needs to be managed on per concurrent SGL basis.
When the DMA resumes data transfer on a SGL buffer, the DMA Master needs to reload the partial context to allow proper physical address calculation. The SGL partial context not only adds very significant complexity to both the DMA engine, the DMA Master, but also adds cost for the context storage, and reduces the performance of DMA engine because of the extra processing step involved in context management. This problem can be particularly severe in the storage controller application that needs to support a large number of concurrent I/Os (SGLs) that are time interleaved over the physical bus.
For example, assuming that the SG List contained elements each containing fragments of 1 byte (Length field=1), the Master would have the information contained in the eighth SG element during the transfer of the eighth byte of data. The Master must also keep track of the total data transferred by adding the Length fields of all the fragments in the SG elements that it had traversed so far. This should be done in order to keep track of when to stop transferring data. For example, even though the Master fetches the eighth SG element, which has a fragment of size 1 byte, it has to know that this is the eighth byte being transferred in order to keep track of the total bytes transferred. If at this time, the DMA Master had to abort this transfer and then subsequently retry it or if it had to retry starting from (for example) the seventh byte of data, it would have to traverse the SG List starting from the descriptor table, as it does not have the information required to traverse backwards (a fundamental limitation of SG Lists, as discussed earlier). This again results in a wastage of bandwidth and performance.
Error Recovery and Debug:
Most SG lists are created by drivers that run on the Host operating system. Imagine a case where a driver has a bug, wherein the transfer size is larger than the total size of the memory block contained in an SG list. The DMA Master cannot tell the end of an SG list. If it has more data to transfer, it will move on to the memory locations immediately after the last correct SG element and incorrectly assume that it is the next SG element. It would then interpret the random data in those memory locations as contents of an SG element. Two scenarios can happen in this case:
1. The DMA Master could attempt to read/write to a non-existent address. This could cause a memory error leading to a system crash.
2. The DMA Master could potentially overwrite valuable data on an existent unintended location pointed to by the false SG element, causing a system crash or other potentially fatal failures. The problem with this type of error is that the system may not immediately fail, but may fail later when it attempts to use the data that has been overwritten by the DMA Master.
It is, therefore, desirable to provide an address translation scheme and cache with a modified scatter gather element. It is also desirable to provide for approaches that address certain scenarios and provide for improved performance.