Various prior art techniques exist for the transfer of data between system memories or between system memories and input/output (I/O) devices. FIG. 1 shows a conventional data processing system 100 comprising a processor local memory 110, a host uniprocessor 120, I/O devices 130 and 140, system memory 150 which is usually a larger memory store with longer access delay than the processor local memory, and a direct memory access (DMA) controller 160.
The DMA controller 160 provides a mechanism for transferring data between processor local memory and system memory or I/O devices concurrent with uniprocessor execution. DMA controllers are sometimes referred to as I/O processors or transfer processors in the literature. System performance is improved since the host uniprocessor can perform computations while the DMA controller is transferring new input data to the processor local memory and transferring result data to output devices or the system memory. A data transfer between a source and a destination is typically specified with the following minimum set of parameters: source address, destination address, and number of data elements to transfer. Addresses are interpreted by the system hardware and uniquely specify I/O devices or memory locations from which data must be read or to which data must be written. Sometimes additional parameters are provided such as data element size. One of the limitations of conventional DMA controllers is that address generation capabilities for the data source and data destination are often constrained to be the same. For example, when only a source address, destination address and a transfer count are specified, the implied data access pattern is block-oriented, that is, a sequence of data words from contiguous addresses starting with the source address is copied to a sequence of contiguous addresses starting at the destination address. Array processing presents challenges for data transfer both in terms of addressing flexibility, control and performance. The patterns in which data elements are distributed and collected from PE local memories can significantly affect the overall performance of the processing system. One important application is fast Fourier transform (FFT) processing which uses bit-reversed addressing to reorder the data elements. With the advent of the manifold array (ManArray) architecture, it has been recognized that it will be advantageous to have improved techniques for data transfer which efficiently provide these and other capabilities and which are tailored to this new architecture.