Inverse discrete cosine transformation (IDCT) is an algorithm that is commonly used in digital video image processing. Currently, IDCT operations require substantial computational overhead. For example, consider a two-dimension IDCT operation for an 8×8 macroblock of pixels. An input 8×8 matrix is organized into eight rows of eight data elements. With reference to FIG. 1A, a prior art IDCT processor 100 is shown. An input matrix is received, wherein data elements from each row 101 are fed into a one-dimensional (1D) IDCT engine 110, and then the result is stored back to a transpose memory 120. IDCT operations are performed on each input matrix row 101 at 1D IDCT engine 110 at multiplier 140 and accumulator 150 under the control of controller 130 in conjunction with registers 160, as shown in FIG. 1B.
After the operations on all eight rows are completed, the transpose memory 120 will be reorganized such that row data becomes column data, as shown in FIG. 1C. Then the transposed row data elements will again be inputted into 1D IDCT engine 100 for a second time to obtain output matrix rows 102 of a two-dimensional matrix as the final results of a two-dimensional inversed directional cosine transformation (IDCT) operation, as shown in FIG. 1A.
The operations can be illustrated by examining an input row of an 8×8 matrix to a 1D IDCT engine, e.g., input row—[x0 x1 x2 x3 x4 x5 x6 x7]. The input row is separated into two rows, e.g., a first row Xi—[x0 x2 x4 x6], and a second row Xe=[x1 x3 x5 x7]. Furthermore, an output row of the IDCT operation, e.g., output row—[o0 o1 o2 o3 o4 o5 o6 o7], is also separated into two rows, e.g., a first row Oi=[o0 o1 o2 o3], and a second row Oe—[o7 o6 o5 o4]. The correlations between the input rows and output rows can be expressed as shown in Equations 1 and 2:Oi=(0.5)(P*Xi+Q*Xe)  (1)Oe=(0.5)(P*Xi−Q*Xe)  (2)where P and Q are 4×4 matrix of IDCT conversion constant. These operations are generally referred to as very large-scale integration (VLSI) IDCT operations. An example of a VLSI IDCT operation is described in “VLSI Design and Implementation of Different DCT Architectures for Image Compression,” by Sherif T. Eid. As the major computations involved in such operation are multiplications, the complexity can be readily measured by a simply determining the number of multiplications required for completing the calculations as described above.
The computation of Oi and Oe according to Equations 1 and 2 can be carried out simultaneously with calculated P*Xi and Q*Xe for performing the computations. In order to generate the first set of computational data elements to transpose memory, it takes (16+16)*8=256 multiplication operations. For the purpose of generating complete computational results (e.g., two-dimensional), the total number of multiplication operations is 2*256=512. Assuming IDCT engine 110 has only one multiplier processing circuit, one adder and one accumulator, it will take at least 512 clock cycles to generate complete 8×8 result matrix.
Furthermore, the configuration and method of writing and retrieving data from conventional random access memory (RAM) for the purpose of carrying out a matrix transpose operation also requires extra large number of clock cycles. Specifically, the matrix transpose operation as that carried out in a conventional IDCT operation is implemented with a RAM that requires multiple clock cycles in first writing the data elements arranged as an array of row data then to retrieve the data in a sequential order arranged as an array of column data in order to transpose the matrix. The speed of data processing and display of image data are adversely affected due to these clock cycle requirements.
For example, consider a conventional 4×4 RAM including of four sets of 4×1 RAM where the first set of 4×1 RAM is employed to store an data array represented by [d11 d12 d13 d14] and the last set of 4×1 RAM stores data array represented by [d41 d42 d43 d44]. In a matrix transpose operation, the data array is first stored as a row and read out as a column. The operations are illustrated in matrix transpose operation 180 of FIG. 1D. Matrix transpose operation 180 requires four clock cycles to write four rows of data array into the 4×4 matrix. Similarly, using a conventional 8×8 RAM, an 8×8 matrix transpose operation requires eight clock cycles.
Accordingly, a need exists for a method or system for performing a more efficient inverse discrete cosine transform (IDCT) operation that applies an adaptive matrix element-trimming algorithm to eliminate unnecessary multiplications to reduce the wastes of computation resources. Furthermore, a need exists for a method or system for performing a more efficient IDCT operation that improves the configuration of the memory (e.g., RAM) by writing and retrieving data to restructure the transposed matrix such that the clock cycle requirement can be reduced.