The first implementation of Discrete Cosine Transform (DCT) and Inverse Discrete Cosine Transform (IDCT) was introduced by N. Ahmed, T. Natarajan and K. R. Rao (N. Ahmed, T. Natarajan, and K. R. Rao; Discrete Cosine Transform; IEEE Transactions on Computers, 90–93, 1974). The algorithm introduced by the Ahmed reference requires a large number of calculations to achieve an accurate result. This first implementation was advanced by the DCT and IDCT algorithm generated by W. Chen, C. H. Smith and S. C. Fralick (W. Chen, C. H. Smith, and S. C. Fralick; A Fast Computational Algorithm for the Discrete Cosine Transform; IEEE Transactions on Communications, COM-25(9):1004–1009, 1977). The Chen algorithm improved upon the Ahmed algorithm but still requires numerous calculations.
More and more microprocessors now provide instructions and associated hardware to accelerate the execution of multimedia applications. The multimedia extensions implemented in such microprocessors can be based on Single Instruction Multiple Data (SIMD) mode of computing. Hitachi has produced such a microprocessor labeled the SH5. The SH5 utilizes the SIMD mode which allows the SH5 to simultaneously compute the same instructions on up to four different data values.
The two-dimensional, 8×8 IDCT is a commonly used function in various video decompression applications. Some multimedia standards, like MPEG-2, require a certain level of IDCT accuracy as enunciated in the IEEE 1180 compliance test (IEEE Standard Specifications for the Implementation of 8×8 Inverse Discrete Cosine Transform, IEEE Std. 1180-1990). The brute-force IDCT solution for and 8×8 matrix, as is well known in the art, requires 4096 multiplications and 3584 additions.
For a given 2D DCT sequence [X(m,n),0≦m,n≦N−1], the 2D IDCT sequence [x(i,j),0≦i,j≦N−1] is determined as:             x      ⁡              (                  i          ,          j                )              =                            4                      N            2                              ⁢                        ∑                      m            =            0                                N            -            1                          ⁢                              ∑                          n              =              0                                      N              -              1                                ⁢                                    c              ⁡                              (                m                )                                      ⁢                          c              ⁡                              (                n                )                                      ⁢                          X              ⁡                              (                                  m                  ,                  n                                )                                      ⁢            cos            ⁢                          {                                                                    (                                                                  2                        ⁢                        i                                            +                      1                                        )                                    ⁢                  m                  ⁢                                                                          ⁢                  π                                                  2                  ⁢                  N                                            }                        ⁢            cos            ⁢                          {                                                                    (                                                                  2                        ⁢                        j                                            +                      1                                        )                                    ⁢                  n                  ⁢                                                                          ⁢                  π                                                  2                  ⁢                  N                                            }                                                      where      ⁢                          ⁢              c        ⁡                  (          k          )                      =          {                                                  1                              2                                                                        form              =              0                                                            1                                otherwise                              Generally the separability property of IDCT can be exploited while computing 2D IDCT by performing 1D IDCT on the input matrix in one direction (for example, by row) and then doing another 1D IDCT on the output of the first in an opposite direction (by column). For a given DCT sequence [X(k),0≦k≦N−1], the 1D IDCT sequence [x(n),0≦n≦N−1] is defined as       x    ⁡          (      n      )        =            ∑              k        =        0                    N        -        1              ⁢                  X        ⁡                  (          k          )                    ⁢      cos      ⁢              {                                            (                                                2                  ⁢                  n                                +                1                            )                        ⁢            k            ⁢                                                  ⁢            π                                2            ⁢            N                          }            where the multiplying constant has been neglected and X(0) has been manipulated. Thus, for N=8, this can be viewed as an 8×8 matrix times an 8×1 vector.
In Chen's algorithms, Chen assumes floating-point (referred to as real in the Chen reference) datatypes and further, does not discuss the implementation of the algorithms nor the limitations of the algorithms resulting from implementation.
Chen's DCT algorithm involves only floating-point operations and is applicable for any N where N is a power of 2. The generalization consists of alternating sine/cosine butterfly matrices with binary matrices to reorder matrix elements in a form that preserves a recognizable bit-reversed pattern at every other node. The computational complexity of Chen's algorithm is                     3        ⁢        N            2        ×          〈                        log          ⁢                                          ⁢          N                -        1            〉        +  2floating-point additions and       N    ⁢                  ⁢    log    ⁢                  ⁢    N    -            3      ⁢      N        2    +  4floating-point multiplications for N inputs.
Chen's algorithm requires 16 multiplications and 26 additions per 1D 8×1 IDCT. This raw complexity, although much better than brute-force, is inferior compared to many other IDCT algorithms. A complexity estimate of a simple implementation of Chen's IDCT algorithm on a parallel processor or microprocessor, such as an SH5, is shown below. This implementation assumes the inputs to be 16-bits wide, and expands every intermediate product of the two 16-bit inputs to 32-bits in order to maintain an accuracy which will meet the IEEE 1180 standard.                Brute force non-optimized cycle count analysis:        In one direction:        Initialization (load constants, setup pointers): 20                    Load inputs: 8            Shifting of inputs: 8                        
Stage 1:Multiplication:16Additions:8Rounding Additions:8Shifts:8Conversion:4Subtotal:44Stage 2:Multiplications:2 + 2 + 4 + 4 = 12Additions:1 + 1 + 2 + 2 + 1 + 1 + 1 + 1 = 10Rounding Additions:2 + 2 + 2 + 2 = 8Conversions:1 + 1 + 1 + 1 = 4Subtotal:42Stage 3:Multiplications:2 + 2 = 4Additions:1 + 1 + 1 + 1 + 1 + 1 = 6Rounding Additions:2 + 2 = 4Shifts:2 + 2 = 4Conversions:1 + 1 = 2Subtotal:20Stage 4:Additions:8Total in one iteration:130                Total in one direction: 130*2=260        Transpose: 32        Total in the other direction: 2*(44+42+20+8)=2*114=228        Transpose: 32        Clipping: 32        Store output: 16        Total cycle count for 2D (8×8) IDCT: 20+260+32+228+32+32+16=620 cycles        
There exists a number of algorithms that reduce the computational complexity of 8×8 IDCT. But the irregular memory access patterns of most of these algorithms do not make them conducive to efficient implementation. In addition, there is not an efficient and effective method for computing an IDCT which can meet the IEEE 1180 accuracy constraints. The Intel Corporation has published an implementation of IDCT using MMX instructions in an application note (Using MMX Instructions in a Fast IDCT Algorithm for MPEG Decoding; Application Note, http://developer.intel.com/drg/mmx/appnotes/ap528.htm). But this implementation is not compliant with the IEEE 1180 standard.