Software based communication systems deal with the transfer of massive quantities of data. A software implementation of the physical layer of wireless broadband communication systems, particularly 4G wireless systems, deals with very high rate data transfers due to the wider transmission spectrum and shorter frames of transmission. These systems deal with the transfer of large amounts of data that must be completed within tight timing constraints. Direct Memory Access (DMA) hardware units are used to fulfill data transfer tasks. FIG. 1 presents a typical conventional DMA implementation. DMA 10 includes a master interface 11 and a slave bus interface 12 (registers) and is coupled by an advanced high performance system bus 13 to a plurality of peripherals, such as USB's, printers, etc., of which only two are shown, a source peripheral A 14 and a destination peripheral B 15. Each source peripheral 14 is associated with a source FSM (Fast Sequencing Module) 9 and each destination peripheral 15 is associated with a destination FSM 19. A driver CPU 16 is coupled to bus 13 for programming the registers (master interface (I/F) for transferring data 11 and control interface (I/F) 12 for transferring control), i.e., from where to read, where to write and how much data to take each time, or the size of the data block. Driver CPU 16 is coupled to a plurality of CPU's in the system (not shown), which share the use of the same DMA 10. A plurality of peripheral buses are provided for delivering control information to DMA 10. Multiple DMA transactions may take place simultaneously through different channels. Channels are available for performing transactions in parallel, but they all use the same bus. An arbiter 18 is provided to select the order of memory transfer between various destination peripherals.
The characteristics of existing DMAs are:                Transfers—A DMA can transfer a block of data of known length from: 1) One memory location to another memory location. Usually it transfers data from slow memory, like DRAM, to a faster memory inside the CPU (Central Processing Unit); 2) A memory to an output device; 3) An input device to memory.        Shared resource—A DMA is a shared hardware device, handled by the Operating System driver. All CPUs in the system and all tasks use a single driver entity.        Programming—Programming of the DMA requires loading of a number of control words, like Source address, Destination address, stride, block length, bus control information, etc. The DMA programming is accomplished by writing control information via a control bus. Usually the control bus is slower than the CPU.        Addresses sequence—Addresses in memory are either contiguous or with pre-defined jumps, called stride. Copying from a contiguous memory block into a non-contiguous memory block with fixed stride is called “scatter”, while transactions that copy from a non-contiguous block with fixed stride into a contiguous memory block are called “gather”.        A DMA can either be programmed to perform a single block transfer or it may read the programming data for the next transaction automatically from a linked list in the memory.        Synchronization—Many DMA transactions in 4G implementations take place between a local memory of a hardware unit and a main system memory. The traditional approach is, once the hardware unit completes generating data into its local memory, it issues an interrupt signal to a processor (its driver in the operating system) indicating that the data is available. The processor stops its current operation and programs the DMA to transfer the data from the hardware unit local memory into the system memory. Once the DMA has finished the memory transfer, it also generates an interrupt signal to the CPU of the operating system. The CPU then programs the hardware unit to start processing the next task. This process requires two interrupts to the CPU (operating system) for each DMA transaction. The overhead is significant.        
The main limitations of conventional DMA units in high speed communication systems are:                DMA programming is implemented by the operating system—As the DMA is a shared resource for all tasks and all CPUs in the system, programming each DMA transaction requires an Operating System (OS) request. This mechanism requires control transfer from the running process (or task) to the OS, which imposes a big overhead. In 4 G communication systems, a single unit may require 100,000-1,000,000 different transactions in one second. Programming so many transactions in one second through the operating system is not feasible.        DMA control is implemented through a control bus. Control buses are slow and require many CPU clocks to program one parameter. FIG. 1 shows that the CPU to DMA interface is implemented by system and control buses. Programming a typical DMA transaction requires writing about 5 words, which causes the programming of a typical DMA transaction to take 15-20 CPU cycles. This is a major obstacle when DMA transactions are short (e.g., up to 100 words).        The DMA transaction addresses sequence is limited to linear or scatter/gather, only. In 4G systems, many data blocks are received with different timing and, thus, are stored in different locations in memory. When arbitrary patterns of data access are required, conventional DMA cannot be used, but the CPU calculates each address separately and uses load store instruction. The efficiency of data fetching determines the efficiency of the processing. Thus, non-efficient DMA transfers translate into non-efficient CPU usage.        Synchronization—As described in the previous section, the synchronization of DMA operation with the application software requires two operating system interrupts. With the high number of DMA transactions, a mechanism that requires 2 interrupts to the CPU for each DMA transaction is not feasible.        
It is an object of the present invention to solve many of the limitations of existing DMA units in high end wireless communication products.