The Discrete Cosine Transform (DCT) is a frequently used operation in several applications. For example, it is advantageous in the area of video image data compression to apply a two dimensional (2-D) DCT to inputted video data.
The discrete cosine transform is defined as: ##EQU3## where x.sub.ij are input elements in an N.times.N matrix, with i,j=0,1,2, . . . , N-1 and where z.sub.mn are output elements in an N.times.N matrix with m,n=0,1,2, . . . , N-1 and where: ##EQU4## The Inverse Discrete Cosine Transform (IDCT) is defined as: ##EQU5## The matrix expression for equation (1) is EQU Z=CXC.sup.t ( 3)
where X is the input data matrix, C is the cosine transform matrix and C.sup.t is the transpose of the matrix C. If the input matrix X is an N.times.N matrix, then equation (3) may be expressed as: ##EQU6## The computation of the IDCT is similar to that of the DCT except that the matrix C.sup.t is substituted for the matrix C and vice versa. Therefore, the discussion may continue with only the DCT without loss of generality. Equation (3) may be divided into two multiplications as follows: EQU Y=CX (5) EQU Z=YC.sup.t ( 6)
where Y is an intermediate product matrix and C.sup.t is the transpose of C.
There are several conventional techniques and circuits for performing a DCT of an inputted data matrix. N. Ahmed, T. Natarajan & K. Rao, "Discrete Cosine Transform," IEEE Trans. on Computers, vol. C-23, January, 1974, 90-93 teaches a general DCT technique. N. Cho & S. Lee, "Fast Algorithm and Implementation of 2-D Discrete Cosine Transform," IEEE Trans. on Cir. & Sys., vol. 38, no. 3, March 1991, p. 297-305 discloses a circuit that computes an 2-D N.times.N DCT using N 1-D DCT circuits or with multiplexers and N/2 1-D DCT circuits. M. Sun, T. Chen & A. Gottlieb, "VLSI Implementation of a 16.times.16 Discrete Cosine Transform," IEEE Trans. on Cir. & Sys., vol.36, no. 4, April, 1989, p.610-17 teaches a 2-D DCT circuit with DCT stages having a memory for storing partial results in the matrix calculations of 1-D DCT's. H. Hou, "A Fast Recursive Algorithm For Computing the Discrete Cosine Transform," IEEE Trans. on Acoustics, Speech and Signal Processing, vol. ASSP-35, no. 10, October, 1987, p.1455-61 teaches decimation techniques for reducing an N.times.N DCT to an N/2.times.N/2 DCT.
FIG. 1 shows an efficient circuit 10 for computing the product of three matrices disclosed in U.S. Pat. No. 5,204,830 which may also be used to compute the DCT of equation (4). The circuit computes a 2.times.2 DCT using a first 1-D DCT stage 12 for computing Y=XC.sup.t and a second 1-D DCT stage 14 for computing the matrix Z=CY. The first stage 12 has a central column register circuit 20 for receiving one element of the matrix X per clock cycle and one column multiplication circuit 16 or 18 for each column of the matrix C.sup.t which column multiplication circuit 16 or 18 receives one element of the corresponding column per clock cycle. Each column multiplication circuit 16, 18 computes the elements of a corresponding column of the matrix Y using rowcolumn multiplication, i.e., by multiplying the elements of the inputted column by corresponding elements of an appropriate row of the matrix C.sup.t and adding together the products. To compute all of the elements of the column of the matrix Y, each column multiplication circuit 16 or 18 receives the elements of the corresponding column of the matrix C.sup.t N times, while the central column register circuit receives each element of the matrix X in the order from the first row to the last row. The computed elements of the matrix Y are then inputted to the second stage 14 which operates in a similar fashion as the stage 12 to compute the elements of the matrix Z.
The column multiplication circuits 16,18 have a plurality of processing elements PE1, PE2, and PE3 which compute part of a product according to Booth's algorithm. According to Booth's algorithm, a multiplicand is multiplied by an M-bit multiplier by examining each m.sup.th bit of the multiplier (0.ltoreq.m.ltoreq.M-1) one at a time. If the m.sup.th bit of the multiplier is a logic `1`, the multiplicand shifted m bits to the left is accumulated in a running total which, after examining the M-1.sup.th bit produces the product. Thus, the multiplicand 6 (`110` in binary) is multiplied with the multiplier 11 (`1011` in binary) by adding together `110` (for the 0.sup.th bit of `1011`), 1100 (for the 1.sup.st bit of `1011`) and `110000` (for the 3.sup.rd bit of `1011`) to produce `1000010`. Each processing element PE1 16-1, PE2, 16-2, . . . , PE3 16-4 computes a partial result for a particular m.sup. th bit, with the processing element PE3 16-4 also adding together several multiplicand-multiplier products to compute an element of the matrix Y. For example, the processing element 16-2 corresponds to the m=1.sup.th bit and utilizes the m=1.sup.th bit of the matrix element stored in the register 21-1 as a selector control bit to a multiplexer 36-2. If the bit is a logic `1`, the multiplexer selects the multiplicand (inputted from the register 30-1) shifted left m=1 bits. If the bit is a logic `0` then the value `0` is selected. The selected value outputted from the multiplexer 36-2 is added to an accumulating result inputted from the register 24-1 using an addition register 24-2. The circuits 16 and 18 are pipelined in that they sequentially receive one pair of arguments per cycle which trickle down from processing element PE1, PE2, . . . , to processing element PE2, . . . , PE3 in an analogous fashion to an assembly line.
The circuit 10 is modular and efficient. However, it is desirable to further reduce the chip area occupied by each 1-D DCT or IDCT stage.
It is therefore an object of the present invention to overcome the disadvantages of the prior art.