The Discrete Cosine Transform (DCT) is a frequently used operation in several applications. For example, it is advantageous in the area of video image data compression to apply a two dimensional (2-D) DCT to inputted video data.
The 2-D DCT is defined as: ##EQU1## where x.sub.ij are input elements in an N.times.N matrix, with i,j=0,1,2, . . ., N-1 and where z.sub.mn are output elements in an N.times.N matrix with m,n=0,1,2, . . ., N-1 and where: ##EQU2## The Inverse Discrete Cosine Transform (IDCT) is defined as: ##EQU3## The matrix expression for equation (1) is EQU Z=CXC.sup.t ( 3)
where X is the input data matrix, C is the cosine transform matrix and C.sup.t is the transpose of the matrix C. If the input matrix X is an N.times.N matrix, then equation (3) may be expressed as: ##EQU4## The computation of the IDCT is similar to that of the DCT except that the matrix C.sup.t is substituted for the matrix C and vice versa. Therefore, the discussion may continue with only the DCT without loss of generality. Equation (3) may be divided into one dimensional transforms as follows: EQU Y=CX (5) EQU Z=YC.sup.t ( 6)
where Y is an intermediate product matrix or intermediate matrix and C.sup.t is the transpose of the cosine coefficient matrix C. Likewise from equation (6) the following is also true: EQU Z=CY.sup.t ( 6a)
where Y.sup.t is the transpose of the intermediate matrix Y. (Likewise, for an IDCT, we have: EQU X=C.sup.t Y or X=Y.sup.t c (6b))
There are several conventional techniques and circuits for performing a DCT of an inputted data matrix. N. Ahmed, T. Natarajan & K. Rao, "Discrete Cosine Transform," IEEE Trans. on Computers, vol. C-23, Jan., 1974, 90-93 teaches a general DCT technique. N. Cho & S. Lee, "Fast Algorithm and Implementation of 2-D Discrete Cosine Transform," IEEE Trans. on Cir. & Sys., vol. 38, no. 3, Mar. 1991, p. 297-305 discloses a circuit that computes an 2-D N.times.N DCT using N 1-D DCT circuits or with multiplexers and N/2 1-D DCT circuits. M. Sun, T. Chen & A. Gottlieb, "VLSI Implementation of a 16.times.16 Discrete Cosine Transform," IEEE Trans. on Cir. & Sys., vol. 36, no. 4, Apr., 1989, p.610-17 teaches a 2-D DCT circuit with DCT stages having a memory for storing partial results in the matrix calculations of 1-D DCT's. H. Hou, "A Fast Recursive Algorithm For Computing the Discrete Cosine Transform," IEEE Trans. on Acoustics, Speech and Signal Processing, vol. ASSP-35, no. 10, Oct., 1987, p.1455-61 teaches a decimation technique which exploits symmetry properties of the DCT coefficient matrix in order to reduce the number of multiplications of the transformation. More specifically, equation (5) can be rewritten as follows: ##EQU5## where Y.sub.e is a matrix containing the even rows of the matrix Y, Y.sub.o is a matrix containing the odd rows of the matrix Y, X.sub.f is a matrix containing the front rows (in the upper half) of the matrix X, X.sub.r is a matrix containing the recent rows, in reverse order, (in the lower half) of the matrix X and C.sub.1 and C.sub.2 are N/2.times.N/2 DCT coefficient matrices. For example, in an 8.times.8 DCT, equation 7 may be rewritten as: ##EQU6## Y.sub.0, Y.sub.1, . . ., Y.sub.7 are rows of the intermediate matrix Y and x.sub.0, x.sub.1, . . ., x.sub.7 are rows of the input data matrix X. Likewise, the IDCT may be written as follows: where C.sub.1.sup.t is the transpose of C.sub.1 and C.sub.2.sup.t is the transpose of C.sub.2. For example, in an 8.times.8 IDCT, equation 8 may be rewritten as: ##EQU7##
Equations 7, (7a) , (8b) and 8, (8a), (8b) are also applicable in determining Z=CY.sup.t as in Equation 6(a) or its analog in IDCT Y=ZC.
Equations (7a) , (7b) , (8a) and 8(b) provide an advantage in that the number of multiplications necessary to determine each 1-D DCT is reduced by 50%.
FIG. 1 shows one 2-D DCT/IDCT architecture. See U.S. Pat. No. 4,791,598. As shown, the circuit has a first 1-D DCT (or IDCT) circuit 3 which receives an input data matrix X (or 2-D transformed matrix Z) on lines n.sub.1. The 1-D DCT circuit 3 transforms the inputted data matrix according to equation (6a) or (7a)-(7b) (or (6b) or (8a)-(8b)) and outputs an intermediate product matrix Y on lines n.sub.2 to a transpose memory 5. The transpose memory 5 transposes the intermediate product matrix Y and outputs the transpose of the intermediate product matrix Y.sup.5 on lines n.sub.3 to a second 1-D DCT circuit 7. The second 1-D DCT transform circuit 7 outputs a 2-D transformed output matrix Z according to equation 6(a) or (7a)-(7b) (or (6b) or (8a)-(8b)) on lines n.sub.4.
In the case of a DCT circuit, the 1-D DCT circuit outputs a row-column ordered sequence i.e., y.sub.0, y.sub.1, y.sub.2, . . ., y.sub.63. (Herein, as a matter of convenience, an 8.times.8 matrix is assumed wherein matrix elements y.sub.ij, for i,j=0,1,2, . . .,7 will be referred to using single number subscripts as follows: ) ##EQU8## The transpose memory 5 writes out the data in transposed order, in this case column-row order, i.e., y.sub.0, y.sub.8, y.sub.8, y.sub.16, y.sub.32, . . ., y.sub.1, y.sub.9, y.sub.17, y.sub.25, y.sub.33, . . . , y.sub.63. However, an inspection of equation (7a)-(7b), reveals that a 1-D decimation technique DCT circuit 7 requires two elements of the intermediate matrix Y at a time in the order (y.sub.0, y.sub.56), (y.sub.8, y.sub.48), (y.sub.16, y.sub.40), . . . , (y.sub.1, y.sub.57), (y.sub.9, y.sub.49), . . . , (y.sub.31, y.sub.39) (referred to herein as "shuffled column-row order"). Furthermore, it is often desirable to separately apply equations (7a) and (7b) (or (8a) and (8b), in the case of an IDCT) to each data pair successively, rather than simultaneously. This allows using a smaller matrix multiplier in the circuit 7 without significantly affecting the throughput. In such a case, each column of element pairs of the outputted sequence is outputted twice, i.e., the following sequence is outputted: (y.sub.0, y.sub.56), (y.sub.8, y.sub.48), . . . , (y.sub.24, y.sub.32), (y.sub.0, y.sub.56), (y.sub.8, y.sub.48), . . . , (y.sub.24, y.sub.32), (y.sub.1, y.sub.57), (y.sub.9, y.sub.49) , . . . , (y.sub.25, y.sub.33), (y.sub.1, y.sub.57) , (y.sub.9, y.sub.49) , . . . , (y.sub.25, y.sub.33) , . . . , (y.sub.31, y.sub.39).
The sequence outputted from the transpose memory is received at a pre-processing circuit 7a containing an ALU 7b and shift-in parallel-out registers 7c. The shift-in parallel-out registers 7c receive each column outputted from the transpose memory 5 and output two elements, in parallel, in shuffled column-row order.
In the case of an IDCT, the 1-D IDCT circuit 3 outputs the elements of the matrix Y in the order y.sub.0, y.sub.7, y.sub.1, y.sub.6, y.sub.2, y.sub.5, . . . , y.sub.8, y.sub.15, y.sub.9, y.sub.14, . . . , y.sub.59, y.sub.60 (referred to herein as "shuffled row-column order"). The transpose memory thus outputs the elements in transposed order, i.e., shuffled column-row order y.sub.0, y.sub.56, y.sub.8, y.sub.48, y.sub.16, y.sub.40, . . . , y.sub.1, y.sub.57, y.sub.9, y.sub.49, . . . , y.sub.31, y.sub.39. However, an inspection of equations (8a)-(8b) reveals that a 1-D IDCT decimation technique circuit requires two elements per cycle in column-row order, i.e., (y.sub.0, y.sub.8), (y.sub.16, y.sub.24), . . . , (y.sub.1, y.sub.9), (y.sub.17, y.sub.25) , . . . , (y.sub.55 , y.sub.63) (or, advantageously, repeated row-column order: (y.sub.0, y.sub.8), (y.sub.16, y.sub.24) , . . . , (y.sub.48, y.sub.56), (y.sub.0, y.sub.8), (y.sub.16, y.sub.24) , . . . , (y.sub.48, y.sub.56), (y.sub.1, y.sub.9), (y.sub.17, y.sub.25) , . . . , (y.sub.49, y.sub.57), (y.sub.1, y.sub.9), (y.sub.17, y.sub.25) , . . . , (y.sub.49, y.sub.57) , . . . , (y.sub.55, y.sub.63)). Again, the shift-in parallel-out registers 7c of the pre-processor circuit 7a receive each column in shuffled column-row order outputted from the transpose memory 5 and outputs two elements per cycle in column-row order.
It is desirable to implement the architecture shown in FIG. 1 in a fashion which is fully pipelined, i.e., such the flow of intermediate matrices Y from the first 1-D DCT 3, to the transpose memory 5, to the second 1-D DCT 7 is continuous without stoppages. However, the transpose memory 5 is a potential bottleneck. This because of the actual transpose operation performed by the transpose memory 5. For example, data may be written in the transpose memory 5 in row-column order. The data is then read out in the column-row or shuffled column-row order. The time to write the data in the transpose memory 5 is T and the time to read out the data from the transpose memory 5 is T. Thus, a conventional transpose memory utilizes a time 2T to process each matrix Y.
In order to reduce this processing time, several conventional transpose memories have been proposed. In a first conventional architecture, two transpose memories are provided which are used to alternately process each inputted intermediate matrix Y. That is, during a first phase, a first transpose memory writes a first intermediate matrix Y1 therein according to, e.g., row-column ordering. During a second phase, the second transpose memory writes a second intermediate matrix Y2 therein according to row-column ordering. While writing the second intermediate matrix Y2, the first transpose memory reads-out the intermediate matrix Y1 according to column-row ordering. Likewise, during a third phase, the first transpose memory writes a third intermediate matrix Y3 therein in row-column order and the second transpose memory reads-out the second intermediate matrix Y2 in column-row order. This transpose memory is disadvantageous because two memories are required.
FIG. 2 shows a different conventional approach which is disclosed in U.S. Pat. No. 4,791,598. As shown, the transpose memory 5 has a column of shift registers 13 and a column of parallel registers 15 connected at its output. The matrix Y is shifted into the shift register column 13 in column-row order so that each column register 13-1, 13-2, . . ., 13-N receives a corresponding row of matrix elements of Y. When the entire matrix is loaded into the shift register column 13, the individual registers 13-1, 13-2, . . ., 13-N transfer their contents in parallel to corresponding parallel registers 15-1, 15-2, . . ., 15-N connected thereto. Form there, the data in the parallel registers may be fed to a combined core processor-ALU 17. In the meantime, the shift register column 13 is immediately available to receive the next intermediate product matrix Y. This architecture is disadvantageous because many space wasting registers are required.
U.S. Pat. No. 5,053,985 teaches another architecture which is shown in FIG. 3. An input data matrix X is processed by a 1-D DCT circuit 20 which can process data at twice the rate at which the matrix X is inputted. As shown, the 1-D DCT circuit 20 includes a first circuit containing pre-registers and an ALU 30 for preprocessing the data, multiplier and accumulators 35 for performing row-column matrix multiplication and a second circuit containing post-registers and an ALU 40 for post-processing the data. The intermediate matrix Y produced by the 1-D DCT circuit 20 is stored in a RAM 60, e.g., in row-column order. The data is then read out of the RAM 60 in column-row order and processed again by the 1-D DCT circuit 20 at twice the rate at which the matrix X is inputted. However, while a 100 MHZ DCT processor is known from S. Uramoto, Y. Inoue, A. Takabatake, J. Takeda, Y. Yamashita, H. Terane and M. Yoshimoto, "A 100-MHz 2-D Discrete Cosine Transform Core Processor" I.E.E.E J. OF SOLID STATE CIRS., VOl. 27, no. 4, Apr. 1992, p. 429-99, the input data rate is limited to 1/2 of the processor speed.
U.S. Pat. Nos. 5,042,007 and 5,177,704 show transpose memory architectures which use shift registers connected to form FIFO (first-in first-out) memories. In both of these patents, the transpose memories themselves are formed using (N.sup.2 +1) n-bit shift registers and (2N.sup.2 -2) 2-to-1 multiplexers. The multiplexers and shift registers are connected in a configuration for transposing an inputted intermediate matrix. Likewise, U.S. Pat. No. 4,903,231 teaches another transpose memory formed by interconnecting N.sup.2 shift registers and approximately N.sup.2 multiplexers. U.S. Pat. No. 4,903,231 also teaches that elements may be shifted in and out in row-column or column-row order. These architectures are disadvantageous because many space consuming registers and multiplexers are required.
U.S. Pat. No. 4,769,790 teaches a transpose circuit in which the columns (rows) of an inputted matrix are inputted in parallel to a delay circuit which delays each column a different number of cycles. The delayed columns are sequentially inputted to a distribution circuit which rotates the delayed elements received each cycle a number of columns depending on the cycle in which the elements are received. The rotated columns are outputted to a second delay circuit which delays each of the rotated columns. Like the other architectures mentioned above, this architecture requires many space consuming shift registers in each delay circuit and thus occupies a great deal of space.
U.S. Pat. No. 4,918,527 teaches a transpose memory in which a memory is partitioned into two separately accessible halves or planes. Data is written in one half plane while data is read out of the other half plane. Furthermore, the data is stored in the two half-planes such that data of odd rows are permutated two-by-two, i.e., in each odd row, the intermediate matrix is stored such that the elements of each pair of columns are swapped. The disadvantage of this architecture is that a pre-processing unit, having many registers, is required to convert the outputted sequence of individual matrix elements into a sequence of matrix element pairs. Furthermore, the architecture requires extensive space consuming pre and post-decoders including at least nine logic gates for each stored matrix element.
The primary disadvantage of the prior art architectures is that they occupy a great deal of space. This is an important consideration if the 2-D DCT or IDCT circuit is to be formed on a single integrated circuit (IC) chip. Furthermore, none of the above. architectures provides for fully flexible write and read orderings. That is, none of the above architectures provide for writing a matrix into, and subsequently reading the matrix from, the transpose memory according to any combination of read and write orderings (i.e., row-column, column-row, shuffled row-column, shuffled column-row). In the very least, the pre-and post-processing registers and ALUs, transpose memory and 1-D transform circuits must be completely emptied before the order of in which the input matrices are received or in which the output matrices are outputted can be changed.
It is therefore an object of the present invention to overcome the disadvantages of the prior art.