This invention relates to data flow management for processors implementing on-chip parallelism, and more particularly to a coprocessor for managing data transfers to enable efficient performance of a processor exhibiting on-chip parallelism.
A xe2x80x9cmediaprocessorxe2x80x9d as used herein refers to a processor which exhibits on-chip parallelism for processing video or image data. Efficient data flow management is essential for achieving high performance in mediaprocessors. However, current mediaprocessors provide a low-level data transfer interface to application programmers, which not only makes data flow programming difficult, but also makes the source code lengthy and hard to maintain.
Modern microprocessors typically run at sub-GHz clock frequencies with wide datapath. This requires a high data access rate to the memory. However, accessing the off-chip memory involves a long latency, resulting in a limited overall memory bandwidth. Therefore, fast on-chip memory is used to cache those frequently-used data and to reduce the external memory access penalty. In addition, various data prefetch techniques have been developed to reduce the effective access latency, i.e., make the data available in on-chip memory before the CPU actually uses them.
Demanding imaging and video applications with a large amount of raw data are the main target of mediaprocessors. Typical image and video processing algorithms have regular data access patterns. Thus, a program can fetch in advance a block of data that is to be used soon. However, the programming required by conventional mediaprocessors to implement such a load is very limited. Most mediaprocessors require the programmer to partition the processing task and determine the sizes and addresses of the data blocks used for each partition in order to perform a block transfer. Since images and video frames can have varying sizes, the programmer must use irregular partitions and extra control flows to generalize the code, leading to lengthy programs. Further, in some algorithms additional tasks need to be performed such as padding. These are performed through added program instructions, thereby decreasing the efficiency of the data processing task. In many algorithms where the block-based data flow programming is used, it is necessary to explicitly fetch the boundary pixels in addition to the bulk data. Handling such irregular data flows in the program tends to decrease the programmer""s overall productivity and also make the source code harder to upgrade and maintain. Moreover, the degree of concurrency between computation and data flow is diminished. Accordingly, there is a need for a more efficient and versatile manner of performing block data transfers.
A special concern to mediaprocessors with wide a datapath is the overhead of packing multiple small precision operands from different memory locations to a wide word. This is implemented conventionally by explicitly executing instructions in the mediaprocessors, thus consuming valuable CPU cycles and lowering overall performance. Accordingly, there is a need for a more efficient manner of packing multiple operands.
According to the invention, a template data transfer coprocessor is implemented to offload block data transfer operations from a mediaprocessor. The template data transfer coprocessor provides a versatile protocol for specifying various types of block data transfer operations.
According to one aspect of the invention, several types of data transfer templates are introduced: a uni-block template, program-guided template, an indirect template and queue-based template. These templates facilitate various types of data transfer flows in a mediaprocessor. According to an advantage of the invention, rather than computing and setting up lower-level data transfer parameters for each block transfer, parameterized templates provide an efficient and versatile mechanism for the programmer to easily design the needed data flow for a specific algorithm. For example, in 2D convolution and affine warping, the number of source code lines related to data flow programming was substantially reduced (e.g., to approximately ⅙ the number as required to achieve data transfer without the template).
According to another aspect of the invention, the uni-block and program-guided templates are used in block-based data transfers between the on-chip and off-chip memory of the mediaprocessor. According to an advantage of such templates, the input and output data blocks in the on-chip memory are able to be double buffered, so that the mediaprocessor functional unit(s) only need to access the data in on-chip memory for fast computation, while the data transfers from/to the slow external memory are hidden from the functional unit. For each block, the address and size are computed from the template by the template data transfer coprocessor. Therefore, data flow programming is simplified, and the computing burden on the CPU is reduced.
The uni-block template specifies blocks from the source memory space which can be overlapped. Consequently, the boundary blocks are padded or wrapped-around to prepare the necessary overlapping areas along the boundary.
The padding and wrapping are performed during actual data transfers. Therefore, the mediaprocessor computation is further reduced, and the programmer""s burden is lowered due to the fact that these irregular data copy operations are handled automatically by the template data transfer coprocessor.
According to another aspect of the invention, the indirect data flow template allows for an efficient way to assemble data from random locations without creating any significant burden to the processor""s computation engine.
According to another aspect of the invention, the queue-based data flow template facilitates a data transfer between a virtual queue and a physical queue. The virtual queue may be a circularly-accessed memory space and have an arbitrary size. The physical queue buffers the data and provides an operand stream for the computation engine. The template data transfer coprocessor streamlines the data from a 2D memory space into the operand buffer or dispatches the mediaprocessor results stored in the destination operand buffer to a 2D memory space. According to an advantage of such a template, the queue-based data flow eliminates many load/store instructions and relieves the pressure on the register file size. Moreover, the code scheduling becomes simpler without the long latency load/store instructions, and the amount of loop unrolling needed for high performance can be reduced, resulting in more compact code.
According to another aspect of the invention, the template data transfer coprocessor includes a 2D address generation unit (AGU) which has a built-in logic unit to allow a sequence of addresses to be generated uninterruptedly. These addresses correspond to a 2D block, which serves as the smallest data transfer unit. In that way, the central controller in the template data transfer coprocessor does not need to get involved in each address generation event, thus leaving more time for other template computation, e.g., locating the next transfer block, arbitrating among data flows, etc.
According to another aspect of the invention, the template data transfer coprocessor includes a template interpreter which serves as the main controller of the coprocessor. In one embodiment the interpreter employs an event-driven control mechanism to set up a template and compute block information for each template. According to an advantage of the interpreter the computation on each type of template is modularized. Therefore, a new template type may be easily added.
These and other aspects and advantages of the invention will be better understood by reference to the following detailed description taken in conjunction with the accompanying drawings.