The present invention relates to digital signal processing, more particularly to a processor element, processing unit, and processor adapted for efficient execution of both multiplication and other operations such as finding a cumulative absolute difference.
Digital signal processing being multiplication-intensive, the prior art abounds in processor elements that combine a hardware multiplier with other arithmetic and logic facilities such as an adder. Such processor elements (PEs) have often been assembled into array processors in which the individual PEs can operate in parallel for high-speed vector and matrix arithmetic, or can be pipelined to carry out more complex operations that a single PE cannot perform alone.
An example is a prior-art array processor developed for use in telephone apparatus that transmits compressed video images. The processor is configured as a four-by-four array of PEs, each comprising a multiplier and an adder. Operating concurrently and independently, the sixteen PEs perform 4.times.4 matrix operations. In addition, the four PEs in a row of the array can be interconnected to operate as a pipeline.
One operation for which the PEs must be pipelined is that of finding the cumulative absolute difference between two series of inputs, an operation necessary in image compression by the motion compensation method. In each pipeline the first PE finds the difference between two inputs, the second PE takes the absolute value of the difference, and the third PE adds the absolute value to the cumulative total. The fourth PE has no function.
One problem with this prior-art array processor is that only four pipelines can operate in parallel. In the standard motion-compensation method, detection of a single motion vector requires the determination of a large number of cumulative absolute differences, so it would be useful if cumulative absolute difference operations could be performed more than four at a time.
Another problem is that since each PE has its own hardware multiplier, the PEs are large in size. This limits the number of PEs that can be included in an array, especially when the array is implemented on a single semiconductor chip (as in the example above).
A further problem is that an array processor of the above type performs efficiently only in computations such as matrix multiplication that benefit from parallel multiply-add operations. In the cumulative absolute difference operation no use is made of the multiplier in each PE, even though the multiplier accounts for a large part of the PE'S circuitry.
Still another problem is that pipelining itself tends to be inefficient. In the example above, one PE in each pipeline was left idle.