In a typical multi-core system-on-chip (SoC), all the processors share an external main memory (usually DRAM). There are also smaller on-chip local memories in the SoC. The SoC has a hardware memory subsystem that performs requested memory accesses and data transfers to and from the external memory. Processors post requests directly to the memory subsystem as they execute instructions that reference memory. The memory subsystem has a DMA engine that accepts requests for memory transfers from the processors, interfaces to the external memory and performs the requested memory accesses. The DMA engine schedules memory transfers using a deadline-based algorithm. One DMA engine services the entire SoC.
When a processor executing an application that has hard real-time deadlines requests a DMA transfer, there are two factors the programmer needs to take in account: the size of the transfer, and the time by which the transfer needs to be completed. In every system there is also a maximum data rate that is attainable when one DMA transfer is being serviced, which, together with the two aforementioned factors, determines the latest time that the processor can post the transfer request to the engine and still meet the deadline for completing a transfer:Latest Post Time=Time transfer needs to complete−transfer size/max data transfer rateProcessors will in general post transfer requests earlier, sometimes much earlier, than the Latest Post Time.
In standard DMA engine designs, the engine services the transfer requests in chronological order of posting. This scheduling policy is very likely to be close to optimal when there is just one thread of control. However in a multi-threaded single-core SoC or in a multi-core SoC, this scheduling policy is often sub-optimal. There are in general multiple simultaneous contenders for DRAM bandwidth, and whether or not any one processor will meet its deadline may depend on whether other processors are also using the DMA engine. For example, consider a situation where one processor posts its transfer request much earlier than it needs to, and thereby precedes a second processor's posting. Suppose the second processor posts its request just in time, while the first processors transfer is still taking place. Then the second processor's transfer cannot meet its deadline, unless it preempts the first transfer.
Clearly a DMA system design that allows deadlines to be met independently of request posting time provides advantages to the programmer and application designer on a multi-thread or multi-core SoC. In such a design any thread or processor can post transfer requests as early as desired, without affecting the ability of any transfers to meet their deadlines. It is a well-known fact that in a system where there is sufficient bandwidth for all deadlines to be met, the scheduling policy that schedules transfers in order of their deadlines will meet all deadlines. So a DMA engine system that schedules transfers in deadline order will be optimal in this respect, meeting all deadlines whenever that is possible.
It is possible to provide a software library that accepts transfer requests, orders them in deadline order, and then passes them onto a standard DMA hardware engine. As long as there is a hardware capability in the DMA engine for transfers to be interrupted in the middle, i.e., preempted, such a library can implement a deadline-based scheduling policy. However, the overhead of such a library is likely to be excessive when transfer sizes are small. In many multimedia applications, such as MPEG-2, MPEG-4, H.264 and JPEG video encoders/decoders, typical DMA transfers move 2-dimensional arrays of 4*4, 8*4, 4*8, 8*8 and 16*16 bytes. These transfers typically require only a few tens of memory clock cycles each, but are very numerous. The overhead for software ordering will likely exceed the memory access time. But a hardware deadline-based scheduler has much lower ordering overhead than a software deadline-based scheduler, so with hardware the ordering of transfers can more easily be parallelized with memory access completely.
U.S. Pat. No. 5,423,020 to Vojnovich discloses a system including a DMA controller that optimizes bus and DRAM use by varying DMA packets size and looking at arrival times and buffer occupancy.
U.S. Pat. No. 5,506,969 to Wall et al. discloses a method for bus bandwidth management where you give urgency information to the DMA engine. In one embodiment, there's a time-driven management policy that uses shortest deadline first ordering. It determines if the bus has enough bandwidth to meet all deadlines and either (a) orders transfers in terms of deadline, when possible, or (b) defers lower priority requests, when the schedule cannot be met.
U.S. Pat. No. 5,548,793 to Sprague et al. discloses a system for controlling arbitration using memory request signal types representing requests of different priorities from different processors.
U.S. Pat. No. 5,787,482 to Chen et al. discloses a deadline-based scheduler for disk drives (rather than memory chips) that trades off maximum throughput from the disk versus trying to order requests in the order that the applications want them (i.e., deadline order). A heuristic approach for scheduling assumes that an application can assign simple deadlines and doesn't face issues where deadline is not the only factor in assigning a priority.
U.S. Pat. No. 5,812,799 to Zuravleff et al. discloses a non-blocking load buffer (NB buffer) and a multiple-priority memory system for real-time multiprocessing. The NB buffer is a block, similar to a global bus interface, with FIFOs that interface between processors and memories and I/O peripherals. By buffering read and write requests it makes the processors more independent. It addresses issues of a processor being idle while it reads data from high-latency memory or slow peripherals, without having a DMA engine. There may be different priority-based queues so that a high-priority queue doesn't get filled with low-priority requests. Thus, the NB buffer may give processors (or threads) different priorities, assign different FIFOs for each I/O peripheral, and order transactions according to an earliest deadline first strategy to get better DRAM utilization.
U.S. Pat. No. 6,006,303 to Barnaby et al. discloses a memory architecture having multiple DMA engines accessing a common resource. Each engine maintains statistics and updates changes priority dynamically based on changing access demand conditions according to an arbitration scheme that includes latency, bandwidth and throughput.