A direct memory access (DMA) engine moves a block of data from a source device to a destination device autonomously from CPU control. The individual DMA transfer is configured using a descriptor normally containing the source address, destination address, a number of control parameters and a link to the next transfer for the DMA engine to process once complete. Usually a DMA engine will be constructed of a number of independent contexts processing transfers in parallel, each context having dedicated hardware. The descriptor can be fetched from any memory the DMA engine has access to. This can be local dedicated memory or external off-chip memory.
To facilitate the transfer of data to or from a target device (either the source or destination device respectively), DMA engines have a throttle mechanism to automatically stall the transfer until the target is ready to send or receive data. This is usually implemented using a “sync” wire between the target and the DMA engine to indicate the ready status of a first-in-first-out (FIFO) buffer of the target.
When designing a system on chip, the timing between the target raising the sync wire to throttle the data transfer and the DMA engine taking notice is well defined.
However, timings are less tightly defined when a DMA engine transfers data off-chip, i.e. to an external device. To compensate for this, the DMA engine may be configured to wait for a certain predetermined time after a transfer before sampling the sync wire, thus delaying the point at which a subsequent transfer can be performed. This delay time is sometimes referred to as the “completion delay”. There have been a number of techniques for ensuring this completion delay is long enough to cope with variable or unpredictable off-chip timings.
One way is to ensure that the completion delay is long enough so that all possible off-chip transfers are safe. This will naturally have a performance impact on transfers where the completion delay isn't the theoretical maximum, i.e. the delay will often be needlessly long.
Another way is to provide a range of different completion delays each for a different DMA context, and assign transfers to contexts accordingly. However, this reduces flexibility because the allocation of transfers to contexts is restricted.
Another way is to allow each context to have a programmable completion delay. However, all transfers on a given context will still have the same delay unless reprogrammed (which requires extra processor cycles). This therefore reduces flexibility and performance of chains of transfers to multiple targets with varying delays.
It would be desirable to provide a more flexible DMA mechanism for controlling the completion delay of off-chip transfers, without incurring as much of a performance cost or indeed any performance cost at all.