1. Field of the Invention
The invention relates generally to the field of processor chips and specifically to the field of single-instruction multiple-data (SIMD) processors. More particularly, the present invention relates to efficient calculation of Discrete-Cosine Transform (DCT) operations in a SIMD processor.
2. Description of the Background Art
DCT is used by all video compression standards, including JPEG, MPEG-2, MPEG-4.2, MPEG-4.10 (also known as H.264), VC-1, RealVideo by Real Media, DivX, etc. As such, it is used by all DVDs, and in all terrestrial, cable and satellite TV broadcast systems. DCT processing is also used in Personal Video Recorders (PVRs), mobile audio-video streaming, security applications, video phone and video conferencing applications. DCT is usually implemented as a hardware block dedicated to perform DCT functionality on System-on-Chip (SoC) that performs video compression and other functionality for TVs, set top boxes, DVD players, etc. However, as further video standards are developed different types and variations of DCT are required. For example, MPEG-2 uses 8×8 DCT using fractional arithmetic, but new video compression standard H.264 requires 4×4 or 8×8 integer DCTs to be performed according to the new standard. This requires new hardware blocks to be added as existing programmable methods are too slow, and existing DCT blocks are difficult to change because the operation of 4×4 integer DCT and 8×8 fractional DCT are significantly different. Also, dedicated hardware blocks have the disadvantage that they require date to be send from a programmable processor to such a dedicated function block, waiting for latency and then transferring data back to processor memory. Usually such operations are dominated by transfer and latency clock cycles. Transferring 64 elements of a 8×8 would require 64 clock cycles by itself, not counting latency for hardware pipeline calculations and transfer of output data. Furthermore, as we transition from standard definition to full-definition with 1080P resolution the performance requirements for video compression data processing go up by a factor of 6×.
Existing SIMD processor architectures do not support efficient implementation of DCT by the processor. For example, Pentium processor supports SIMD extensions of MMX (Multi-Media Extension), SSE (Streaming SIMD Extension), and SSE2 to accelerate data-crunching intensive applications such as DCT. SSE provides parallelism by a factor of four (64-bits wide), and SSE2 provides parallelism by a factor of eight (128-bit wide). Video decoders only performs inverse DCT (also referred to as iDCT), and video compression encoders perform both forward and inverse DCT operations.
Intel shows that 8×8 iDCT requires 320 MMX clock cycles and 290 SSE clock cycles (AP-922 Streaming SIMD Extensions—A Fast Precise 8xx DCT, 4/99, Version 1). However, it is also shown that (Intel AP-945 Using SSE2 to Implement an Inverse Discrete Cosine Transform—Performance Data) the SSE2 instructions are 1.31 times faster the SSE instructions when both implementations are executed on a Pentium 4 processor. This shows diminishing returns on increased parallelism due to architectural limitations, since SSE2 should be twice as fast due to x2 parallelism in comparison to the SSE.
Implementing the 4×4 integer DCT puts further strains on the Intel processor. Performance analysis by Kerry Widder (Efficient Implementation of H.264 Transform Operations Using Sub word Parallel Architecture) shows that for the reference video sequence of Girl.264, IDCT requires 4.95% of total processing time, and IDCT 4×4 requires 17% percent of total processing according to the performance profiling. The effect of more complex processing of H.264 (by about a factor of 3-5×) combined with additional performance requirements due to full HD displays of 1080P (about a factor of 6×) results in not being able to perform video encode or decode using H.264 even if we dedicate the whole Pentium processor for this purpose. This would also be an expensive solution for consumer TV, set top box and other applications.
AltiVec SIMD provides an 8-wide SIMD and is part of PowerPC processors, which requires about 102 clock cycles including the function call overhead (Freescale AltiVec Application Note, AltiVec 2D Inverse Discrete Cosine Transform Application Note and Code Examples, 2002).
TriMedia-CPU64 is a 64-bit 5 issue-slot VLIW core, launching a long instruction every clock cycle (An 8-Point IDCT Computing Resource Implemented on a TriMedia/CPU64 Reconfigurable Functional Unit, Proceedings of PROGRESS 2001, Veldhoven, The Netherlands, Oct. 18, 2001, pp. 211-218). This paper discusses an augmenting a general purpose processor with a reconfigurable core, which exploits both the general purpose processor capability and FPGA flexibility to implement application-specific computations. The conclusion of this work is that 8-point IDCT can be computed in 16 TriMedia cycles.
Texas Instruments TMS320C64x DSPs are high-performance fixed-point DSP generation in the TMS320C6000 DSP platform, and it features a 8-issue very-long-instruction word (VLIW) architecture. The C64x DSP core processor has eight independent functional units—2 multipliers and 6 arithmetic logic units. The C64x can produce four 32-bit multiply-accumulates (MACs) per cycle. IDCT of 8×8 is performed in 135 clock cycles.
Today's SIMD processor performs vector operations between respective elements of two source vectors. For example, vector-add instruction for a 4-wide SIMD will add respective elements of source #1 and source #2 together, i.e., element #0 of both sources are together and added, element #1 of both sources are paired together and added, and so forth. Alternatively, one of the source vector elements of one first source vector is paired with across all elements of a second source vector. This is referred to as the broadcast mode. DCT operations, however, requires arbitrary pairing of one or two source vector elements. Also, some DCT operations require a different operation to be performed for each vector element position.