This invention relates in general to signal processing and more specifically to Single Instruction Multiple Data (SIMD) coprocessor architectures providing for faster image and video signal processing, including one and two dimensional filtering, transforms, and other common tasks.
A problem which has arisen in image processing technology is that two-dimensional (2-D) filtering has a different addressing pattern than one dimensional (1-D) filtering. Previous DSP processors and coprocessors, designed for 1-D, may have to be modified to process 2-D video signals. The end desired goal is to enable a digital signal processor (DSP) or coprocessor to perform image and video processing expediently. In image processing, the most useful operation is 1-D and 2-D filtering, which requires addressing the 2-D data and 1-D or 2-D convolution coefficients. When the convolution coefficients are symmetrical, architecture that makes use of the symmetry can reduce computation time roughly in half. The primary bottleneck identified for most video encoding algorithms is that of motion estimation. The problem of motion estimation may be addressed by first convolving an image with a kernel to reduce it into lower resolution images. These images are then reconvolved with the same kernel to produce even lower resolution images. The sum of absolute differences may then be computed within a search window at each level to determine the best matching subimage for a subimage in the previous frame. Once the best match is found at lower resolution, the search is repeated within the corresponding neighborhood at higher resolutions. In view of the above, a need to produce an architecture capable of performing the 1-D/2-D filtering, preferably symmetrical filtering as well, and the sum of absolute differences with equal efficiency has been generated. Previously, specialized hardware or general purpose DSPs were used to perform the operations of summing of absolute differences and symmetric filtering in SIMD coprocessor architectures. Intel""s MMX technology is similar in concept although much more general purpose. Copending applications filed on Feb. 4, 1998, titled xe2x80x9cReconfigurable Multiply-accumulate Hardware Co-processor Unitxe2x80x9d, Provisional Application No. 60/073,668 now U.S. Pat. No. 6,298,366 and xe2x80x9cDSP with Efficiently Connected Hardware Coprocessorxe2x80x9d, Provisional Application No. 60/073,641 now U.S. Pat. No. 6,256,724 embody host processor/coprocessor interface and efficient Finite Impulse Response/Fast Fourier Transform (FIR/FFT) filtering implementations that this invention is extending to several other functions.
The proposed architecture is integrated onto a Digital Signal Processor (DSP) as a coprocessor to assist in the computation of sum of absolute differences, symmetrical row/column Finite Impulse Response (FIR) filtering with a downsampling (or upsampling) option, row/column Discrete Cosine Transform (DCT)/Inverse Discrete Cosine Transform (IDCT), and generic algebraic functions. The architecture is called IPP, which stands for image processing peripheral, and consists of 8 multiply-accumulate hardware units connected in parallel and routed and multiplexed together. The architecture can be dependent upon a Direct Memory Access (DMA) controller to retrieve and write back data from/to DSP memory without intervention from the DSP core. The DSP can set up the DMA transfer and IPP/DMA synchronization in advance, then go on its own processing task. Alternatively, the DSP can perform the data transfers and synchronization itself by synchronizing with the IPP architecture on these transfers. This architecture implements 2-D filtering, symmetrical filtering, short filters, sum of absolute differences, and mosaic decoding more efficiently than the previously disclosed Multi-MAC coprocessor architecture (U.S. Pat. No. 6,298,366 titled xe2x80x9cReconfigurable Multiply-Accumulate Hardware Co-Processor Unitxe2x80x9d, filed on Jan. 4, 1998 and incorporated herein by reference). This coprocessor will greatly accelerate the DSP""s capacity to perform specifically common 2-D signal processing tasks. The architecture is also scalable providing an integer speed up in performance for each additional Single Instruction Multiple Data (SIMD) block added to the architecture (provided the DMA can handle data transfers among the DSP and the coprocessors at a rapid enough rate). This technology could greatly accelerate video encoding. This architecture may be integrated onto existing DSPs such as the Texas Instruments TMS320C54x and TMS320C6x. Each of these processors already contains a DMA controller for data transfers.