The DCT is considered to be the most effective technique among various transform coding methods for image compression or video bandwidth compression. A DCT is similar to a Discrete Fourier Transform (DFT) but includes only cosine terms. In achieving bandwidth compression in this way, a square block of digitally encoded picture elements or pixels can be transformed into the frequency domain by means of a two-dimensional (N.times.N) DCT processor to which the N.times.N block of pixel data is applied, and wherein the input data matrix is multiplied by an N.times.N discrete cosine matrix to yield an intermediate matrix, and then the transpose of the intermediate matrix is multiplied by the same discrete cosine matrix to yield the desired two-dimensional transformed matrix. The elements of the transformed matrix can then be quantized and only the most energetic terms therein need be transmitted. At the receiver an inverse transformation is performed to reconstruct the original video signal in the space domain. For the N.times.N DCT, larger N achieves better compression ratio but requires more computation.
Matrix multiplication involves forming the inner products of two N.times.1 vectors to yield a single element of the product matrix. Thus each element of a row of the input matrix must be multiplied by each of the corresponding elements of a column of the cosine matrix and the products summed to yield a single element of the product matrix. Thus, for the transformation of a 16.times.16 block of pixels, 16 products must be summed to yield a single element or coefficient of the intermediate and the transformed matrices, each of which has 256 elements. Many fast algorithms have been derived to reduce the number of computations required. For example, the DCT matrix has been decomposed into several sparse matrices, which result in butterfly structures. These butterfly structures reduce the computation significantly but still require many high speed multipliers which require large silicon area for IC implementation and result in messy interconnections, poor routing on chips, and irregular shape. All of these factors make VLSI (Very Large Scale Integration) implementation of butterfly structures very inefficient. An example of VLSI implementation of the DCT using such a structure is described in an article entitled, "A Discrete Fourier-Cosine Transform Chip" in the IEEE Journal on Selected Areas in Communication, Jan. '86, pp. 49-61. The resulting chip shown in FIG. 17 therein includes many multipliers, does not efficiently utilize the silicon area, and can implement only an 8.times.1, one-dimensional transform. The two-dimensional transform contains two one-dimensional transforms and needs temporary storage for intermediate results and matrix transposition, thus it is much more complex than the one-dimensional transform.
Our invention is a response to a need for real time processing of two-dimensional DCT which can be efficiently implemented by state-of-the-art VLSI technology. Our invention provides real time processing of 16.times.16 DCT on a single chip. This means that the processor must provide transformed 16.times.16 matrices for application to a quantizer at the same rate that the 16.times.16 input matrices are being generated by the video camera. The processor should be able to handle an input sample or pixel rate of 14.3 MHz which is a rate commonly used in digital video systems with present day MOS technology. Due to the large amount of computation required, real time processing at this rate can be achieved only by exploiting inherent concurrency and parallelism in the architecture. Also, since the silicon area and the design effort needed for implementing an algorithm are heavily dependent on the degree of regularity of the architecture, one can see that the challenge of efficiently implementing DCT in VLSI is to develop an architecture which can realize the enormous number of multiplications required with a regular structure.