In typical embedded system applications, which are special-purpose computer systems designed to perform one or a few dedicated functions, Host Processor utilization is a very critical parameter in determining overall system performance. When the Host Processor spends cycles executing data movement instructions, overall system performance is greatly degraded. In order to relieve the Host Processor of this penalty and improve system performance, hardware-based DMA components were introduced.
A DMA controller is a feature of modern computers that allows certain hardware subsystems within the computer to access system memory for reading and/or writing, independent of the processor. Many hardware systems use DMA, including disk drive controllers, graphics cards, network cards, and sound cards. Computers that have DMA channels may transfer data to and from devices with much less processor overhead than computers without a DMA channel.
Without DMA, using programmed input/output (“PIO”) mode, the processor is typically fully occupied for the entire duration of the read or write operation and is thus unavailable to perform other work. With DMA, the processor initiates the transfer, performs other operations while the transfer is in progress, and receives an interrupt from the DMA controller once the operation has been completed. This is especially useful in real-time computing applications where avoidance of stalling behind concurrent operations is critical.
The offload capability provided by traditional DMA components has helped Host Processors meet the system performance requirements of some applications. However, with the advent of more complex applications that require support for multiple data formats, such traditional DMA components fall short of offering sufficient offload capability. For example, when source video data in YUV format that is stored in on-chip memory, needs to be transferred to a display engine requiring such data to be in RGB format, the Host Processor must first read the data, execute instructions to do the format conversion, store the data back to on-chip memory, and finally initiate the DMA operation to transfer the data to the display engine. This flow is shown in the top half of FIG. 4. Because of the fixed functionality of the DMA component, efficiency of data movement is one-third the target.
One solution to address the above problem is adding data format conversion functionality to the external device. The primary drawback of this solution is an increase in total solution cost.
A second solution is to have the Host Processor run another instruction stream to do the data format conversion, and thereafter store the converted data to the on-chip memory. This will waste many Host Processor cycles, which in turn degrades overall system performance, especially for those compute-intensive data conversion operations like YUV/RGB conversion, Encrypt/Decrypt, etc.
A third solution that addresses the above drawbacks is to increase the flexibility of the DMA engine. Adding programmable capability to the current DMA component will offer the best tradeoff between cost and overall performance. However, typical execution of an instruction stream is sequential, which means such programmable capability will degrade bus utilization when data movement and data computation instructions are mixed together. When the computation engine inside such an enhanced DMA component is running, the data movement operation is idle. Since a typical burst operation will be divided into multiple sequential operations, degraded bus utilization will result.
Therefore, what is needed is a way to increase DMA component programmability and bus utilization compared to prior art fixed-function and general programmable DMA implementations, respectively.