1. Field of the Invention
This invention relates to parallel processing. Specifically, it relates to folded memory array structures that are used in conjunction with execution units to perform fast and efficient cosine transform computations. Such computations are needed for real-time signal processing operations such as processing video.
2. Definition of the Problem
Parallel processors based on folded processor arrays are used in application specific systems for high performance computation. A folded processor array is essentially a parallel processor with multiple processing elements arranged so that the elements are interconnected as if a processor array was physically folded and new connections made between the newly adjacent processing elements. A folded array processor is faster and more efficient than an unfolded parallel processor because a higher degree of interconnectivity is achieved. A folded linear systolic array used as a CORDIC processor is described in U.S. Pat. No. 4,972,361 to Rader which is incorporated herein by reference. A systolic folded array used as a Reed-Solomon decoder is described in U.S. Pat. No. 4,958,348 to Berlekamp et al. which is incorporated herein by reference.
More general purpose folded processor arrays are described in the pending U.S. patent applications Ser. No. 864,112, filed Apr. 6, 1992, and Ser. No. 904,916, filed Jun. 26, 1992, which are both assigned to the assignee of the present application and which are incorporated herein by reference. The '112 application describes an array of processing elements folded once. The '916 application describes an array of processing elements folded three times.
Parallel processor arrays are often used to perform complex operations such as cosine transforms and inverse cosine transforms. Cosine transform operations are required for algorithms used in encoding and decoding video data for digital video standards such as the commonly known MPEG and JPEG standards. Many of these algorithms require both the data and its transpose in different sections of the algorithm, for example, two-dimensional discrete cosine transform. In addition, these signal processing algorithms require butterfly operations. The transposition operations in particular can cause a significant bottleneck to a processing apparatus because matrix data is needed in its original and in its transposed form requiring significant data movement operations to prepare the data prior to processing.
What is needed is a transposed, folded memory architecture that is scalable and based on folded parallel processor array concepts. A memory based on such an architecture can be used to store pel data required to perform fast cosine transform operations for video processing. In this case the memory is interconnected with a parallel processor as previously described. By adding appropriate execution units to the memory structure, butterfly operations and one and two dimensional fast cosine transform computations can be performed in the memory structure to preprocess video data for further processing by later stages in a processing system.