The documents listed below are incorporated herein by reference.    [1]. T. Weigand, G. S. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC Video Coding Standard,” IEEE Transaction on Circuits and Systems for Video Technology, Vol. 13, No. 7, pp. 560-576, July 2003,    [2] Draft ITU-U Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264|ISO/IEC 14496-10 AVC), Geneva, Switzerland, May 2003.    [3]. N. Ahmed, T. Natarajan, and K. B. Rao, “Discrete Cosine Transform,” IEEE Transaction on Computers, Vol. C-23, pp. 90-93, January 1974.    [4] H. S. Malvar “Low Complexity Length-4 Transform and Quantization with 16-bit Arithmetic” ITU-T SC 16, September 2001, Docs. VCEG-N43    [5] H. S. Malvar, A. Hallapuro, M. Karczewicz, and I., Kerofsky “Low Complexity Transform and Quantization in H.264/AVC” IEEE Transaction on Circuits and Systems for Video Technology, Vol. 13, No. 7, pp. 598-603, July 2003.    [6] H. Malvar, “Fast Computation of the Discrete Cosine Transform and the Discrete Hartley Transform,” IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. ASSP-35, No. 10, pp 1484-1485, October 1987.    [7] Y. Arai, T. Agui, and Nakajima, “A Fast DCT-SQ Scheme for Images,” Trans. of the IEICEE, Vol. E71, No. 9, pp 1095-1097, November 1988.    [8] W. B. Pennebaker, J. L. Mitchel, JPEG Still Image Compression Standard, Van Nostrand Reinhold, New York (1992).    [9] T. D. Tran, “The BinDCT: Fast Multiplierless Approximation of the DCT,” IEEE Signal Processing Letter, vol. 7, pp. 141-145, June 2000.    [10] J. Liang and T. D. Tran, “Fast Multiplierless Approximations Of The DCT With The Lifting Scheme,” IEEE Trans. on Signal Processing, vol. 49, pp. 3032-3044, December 2001.    [11] Steve Furber, “ARM System-on-chip Architecture”, Addison Wesley 2000.    [12] ARM Limited, ARM946-E-S Technical Reference Manual, 2001.
H.264/AVC is the latest video compression standard. It was developed by the Joint Video Team (JVT), which includes experts from the Video Coding Experts Group (VCEG) of the International Telecommunications Union (ITU-T) and the Moving Picture Experts Group (MPEG) from the International Organization for Standardisation (ISO) and the International Electrotechnical Commission (IEC). In ITU-T's documents, the formal name of the new video compression standard is ITU-T Recommendation H.264. The ISO/IEC called it the ISO/IEC 14496-10 Advanced Video Coding. For short reference, this new video compression standard is commonly referred as the H.264/AVC.
H.264/AVC has many applications, including: video broadcasting over cable, satellite and DSL; video-on-demand or multimedia streaming services; conversational services over ISDN, Ethernet, LAN, wireless and mobile networks; and interactive or serial storage on optical devices such as DVD.
The H.264/AVC was designed for higher coding efficiency. In order to obtain better compression, H.264/AVC standard adopted many advanced video coding techniques. For intra coding, H.264/AVC uses a directional spatial prediction scheme to find more redundancies among pixels within a video frame. For inter coding, H.264/AVC implements multiple frames reference, weighted prediction, de-blocking filter, variable block size and quarter sample accurate motion compensations. For transformation, H.264/AVC uses a small, block-based, integer, and hierarchical transform. For entropy coding, H.264/AVC adopts two different coding techniques. The context adaptive based arithmetic coding (CABAC) is selected for the main profile whereas the context adaptive variable length coding (CAVLC) is used for baseline, main and extended profiles. Three profiles of H.264/AVC support 15 levels. These levels specify sets of algorithms and parameters for a wide range of video applications.
Note that the integer transform of H.264/AVC has lower complexity than that of the Discrete Cosine Transform (DCT) in previous video compression standards. Fifteen levels of H.264/AVC, however, cover a wide range of video formats from SQCIF (128×96) to 16:9 (4096×2304). For real time video processing, the number of macroblocks that must be processed per second is very high and is not efficient for software implementation. For instance, to process 30 frames of CIF (352×288) video format in real time, an embedded processor must process 11,880 macroblocks, which requires 36,495,360 shift and add instructions. Without load, store and transposition operations, this complexity already costs more than 36 million instructions per second (MIPS). This computational complexity is high for most embedded applications. The 16:9 format has an even far greater computational complexity than CIF.
The Discrete Cosine Transform (DCT) is one of the most important transformations in image and video processing. It has been used in many compression standards which include JPEG, H.261, H.263, MPEG-1, MPEG-2, and MPEG-4. The DCT was first proposed by Ahmed, Natarajan, and Rao in 1974 (see document [3] above). Their landmark paper presents an N-point DCT that can be computed with a 2N-point FFT and some additional post-processing. The one-dimensional DCT can map a vector x of length N into a new vector z of transform coefficients by a linear transformation z=Hx, where H is an N×N matrix such as shown in FIG. 1, and where C1√{square root over (½)} cos (π/8), C2=√{square root over (½)} cos (2π/8), and C3=√{square root over (½)} cos (3π/8). Let X be a 4×4 input matrix, Y be a 4×4 output matrix, and Ht be the transpose matrix of H. The two-dimensional (2-D) 4×4 forward DCT is then defined as Y=HXHt.
A basic disadvantage of the 4×4 DCT is that the entries in H (FIG. 1) are irrational numbers. Hence, both the forward 4×4 DCT and the inverse 4×4 DCT require floating-point execution units. The floating-point implementation increases the hardware complexity of the coding system.
To resolve this problem, Malvar (see document [4] above) suggested a method that scales entries of the 4×4 DCT matrices to obtain integer operations. The output results are rescaled to obtain an approximation of 4×4 DCT. Malvar used the scaling factor α=2.5 (see documents [4] and [5] above). The resulting scaled matrix K is shown at 21 in FIG. 2. FIG. 3 illustrates the use of the integer transform matrix 21 to perform a 4-point one-dimensional integer transform on a vector x.
The approximation of a two-dimensional 4×4 DCT is Z=(KXKt){circle around (x)}S where Z is a 4×4 matrix, and S is a 4×4 resealing matrix. The matrix S is typically incorporated into the quantization stage, which is usually implemented by lookup tables. Therefore, the approximation of a two-dimensional 4×4 DCT can be implemented completely by integer operations. With the matrix S incorporated into the quantization stage, the two-dimensional 4×4 integer transform is W=KXKt.
In view of the computational complexities associated even with integer transform processing of the various video formats supported by H.264/AVC, it is desirable to provide for integer transform processing with reduced computational complexity.