Discrete Cosine Transform
A discrete cosine transform (DCT) is an orthonormal, separable, frequency basis similar to a Fourier transform. The introduction of the DCT was an important advance for image compression. The DCT can be regarded as a discrete-time version of the discrete Fourier-cosine transform (DFT). Unlike the DFT, the DCT is real-valued and provides a better approximation of a signal with fewer coefficients. With the DCT, each block of an image is transformed into a block of coefficients.
The DCT is widely used in image compression applications. For example, a 2D DCT is used for still image compression, moving image compression, and video-telephony coding techniques. Image compression standards are described in greater detail below. The energy compaction property of the DCT is well suited for image compression because in most images the energy is concentrated in the low to middle frequencies to which the human visual system is more sensitive.
The DCT helps separate the image into spectral sub-bands of differing importance with respect to the image's visual quality. The DCT is similar to the DFT in that the DCT transforms a signal or image from the spatial domain to the frequency domain.
For an M×N input image y, the two-dimensional DCT coefficients of the output Y are defined as:             Y      ⁡              (                  u          ,          v                )              =                            2          ⁢                      c            u                    ⁢                      c            v                                                M                    ⁢                      N                              ⁢                        ∑                      m            =            0                                M            -            1                          ⁢                                  ⁢                              ∑                          n              =              0                                      N              -              1                                ⁢                                          ⁢                                    y              ⁡                              (                                  m                  ,                  n                                )                                      ⁢                                          F                                  2                  ⁢                  M                                u                            ⁡                              (                m                )                                      ⁢                                          F                                  2                  ⁢                  N                                v                            ⁡                              (                n                )                                                          ,where the multipliers are                                                                         F                β                α                            ⁡                              (                λ                )                                      =                          cos              ⁡                              (                                                                                                    2                        ⁢                        λ                                            +                      1                                        β                                    ⁢                  α                  ⁢                                                                          ⁢                  π                                )                                              ,                                    and                        c      k        =          {                                                  1                              2                                                                        k              =              0                                                            1                                                              k                ≠                0                            ,                                          where:                M=Number of rows in the input data set,        N=Number of columns in the input data set,        m=Row index in the spatial domain 0≦m≦M−1,        n=Column index in the spatial domain 0≦n≦N−1,        y(m,n)=Spatial domain data,        u=Row index in the frequency domain,        v=Colume index in the frequency domain, and        Y(u,v)=Frequency domain coefficient.        
In the above equation, the input signal or original image in the spatial domain is y(m, n) is, and Y(u, v) is the output signal or converted image in the frequency domain. The function F(.) is used to simplify the notation.
The inverse of the function in the above equation can be used to reconstruct a signal in the spatial domain. Thus, the two-dimensional inverse DCT (2D IDCT) is defined as follows:       y    ⁡          (              m        ,        n            )        =            2                        M                ⁢                  N                      ⁢                  ∑                  u          =          0                          M          -          1                    ⁢                          ⁢                        ∑                      v            =            0                                N            -            1                          ⁢                                  ⁢                              c            u                    ⁢                      c            v                    ⁢                      Y            ⁡                          (                              u                ,                v                            )                                ⁢                                    F                              2                ⁢                M                            u                        ⁡                          (              m              )                                ⁢                                                    F                                  2                  ⁢                  N                                v                            ⁡                              (                n                )                                      .                              
For many applications of interest, M and N have the same value. Substituting M=N in the first equation gives:       Y    ⁡          (              u        ,        v            )        =                    2        ⁢                  c          u                ⁢                  c          v                    N        ⁢                  ∑                  m          =          0                          N          -          1                    ⁢                          ⁢                        ∑                      n            =            0                                N            -            1                          ⁢                                  ⁢                              y            ⁡                          (                              m                ,                n                            )                                ⁢                                    F                              2                ⁢                N                            u                        ⁡                          (              m              )                                ⁢                                                    F                                  2                  ⁢                  N                                v                            ⁡                              (                n                )                                      .                              This equation defines an N×N point DCT function. In many applications, such as video compression, the input video signal is usually partitioned into basic square blocks, and the DCT is performed on these blocks of data. The 8×8 block DCT is most commonly used in image compression applications because it offers a reasonable trade-off between computational complexity and compression efficiency. Substituting N=8 in the last equation (3) results in the following function:       Y    ⁡          (              u        ,        v            )        =                              c          u                ⁢                  c          v                    4        ⁢                  ∑                  m          =          0                7            ⁢                          ⁢                        ∑                      n            =            0                    7                ⁢                                  ⁢                              y            ⁡                          (                              m                ,                n                            )                                ⁢                                    F              16              u                        ⁡                          (              m              )                                ⁢                                                    F                16                v                            ⁡                              (                n                )                                      .                              
Note that the above cosine functions F(.) are not data dependent, although they depend on the dimension of the input data. In case of a fixed input block size of 8×8, the cosine values can be pre-calculated, depending on the DCT process used for a particular implementation.
Video Compression Standards
Video compression enables the storing, transmitting, and processing of visual information with fewer storage, network, and processor resources. The Moving Pictures Experts Group (MPEG) of the International Standards Organization (ISO) generates standards for digital video compression, i.e., a temporal sequence of images or “frames,” and audio compression. In particular, MPEG defines a standard compressed bitstream, which implicitly defines encoders, decoders, and transcoders.
For video signals, MPEG compression removes spatial redundancy within a video frame and temporal redundancy between video frames. As in the JPEG standard for still image compression, DCT-based compression, as described above, is used to reduce spatial redundancy. Motion-compensation is used to exploit temporal redundancy. The images in a video stream usually do not change much during small time intervals. The idea of motion-compensation is to base the encoding of a video frame on other temporally adjacent frames.
The most widely used video compression standards include MPEG-1 for storage and retrieval of moving pictures. MPEG-1 allows analog video and audio signals to be compressed by the ratios in the range of 50:1 to 100:1, depending on the complexity of the input signal and desired quality. MPEG-2 is used for digital television. MPEG-2 enables high-quality video compression at higher data rates than MPEG-1. The H.263 standard is used for video conferencing, see ISO/IEC 11172-2:1993, “Information Technology—Coding of Moving Pictures and Associated Audio for Digital Storage Media up to about 1.5 Mbit/s—Part 2: Video,” D. LeGall, “MPEG: A Video Compression Standard for Multimedia Applications,” Communications of the ACM, Vol. 34, No. 4, pp. 46–58, 1991, ISO/IEC 13818-2:1996, “Information Technology—Generic Coding of Moving Pictures and Associated Audio Information—Part 2: Video,” 1994, ITU-T SG XV, DRAFT H.263, “Video Coding for Low Bitrate Communication,” 1996, ITU-T SG XVI, DRAFT13 H.263+Q15-A-60 rev.0, “Video Coding for Low Bitrate Communication,” 1997.
These standards are relatively low-level specifications that primarily deal with a spatial compression of images or frames, and the spatial and temporal compression of sequences of frames. As a common feature, these standards perform compression on a per frame basis. With these standards, one can achieve high compression ratios for a wide range of applications.
Newer video coding standards, such as MPEG-4 for multimedia applications, see ISO/IEC 14496-2:1999, “Information technology—coding of audio/visual objects, Part 2: Visual,” allow arbitrary-shaped objects to be encoded and decoded as separate video object planes (VOP). The objects can be visual, audio, natural, synthetic, primitive, compound, or combinations thereof. Also, there is a significant amount of error resilience features built into this standard to allow for robust transmission across error-prone channels, such as wireless channels.
The emerging MPEG-4 standard is intended to enable multimedia applications, such as interactive video, where natural and synthetic materials are integrated, and where access is universal. In the context of video transmission, these compression standards are needed to minimize the resources used, such as network bandwidths, memories, and processors.
A great deal of effort has been placed on systems and methods that enable devices to transmit the content robustly and to adapt the quality of the content to the available resources. When the content is encoded, it is sometimes necessary to further transcode the bitstream before it is transmitted through the network at a lower bit-rate or a lower spatial resolution bitstream.
Compressed Video Bitstreams
A video bitstream is a sequence of video frames. Each frame is a still image. A video player displays one frame after another, usually at a rates close to thirty frames per second (fps), 23.976, 24, 25, 29.97, or 30 fps.
Pixels of frames are digitized in a standard RGB format, 24 bits per pixel, i.e., 8 bits each for the red, green, and blue channel. MPEG-1 is designed to produce bit rates of 1.5M bps or less, and is intended to be used with images of size 352×288 at 24–30 fps. This results in data rates of 55.7–69.6 M bps.
MPEG-1 operates on images represented in YUV color space (Y, Cr, Cb). If an image is represented in RGB format, then the image must first be converted to YUV format. In YUV format, images are also represented in 24 bits per pixel with 8 bits for the luminance information (Y), and 8 bits each for the two chrominance information (U and V)). The YUV format is subsampled. All luminance information is retained. However, chrominance information is subsampled 2:1 in both the horizontal and vertical directions. Thus, there are two bits each per pixel of U and V information. This subsampling does not drastically affect quality because the human visual system is more sensitive to luminance than to chrominance information. Subsampling is a lossy compression. The 24 bits RGB information is reduced to 12 bits of YUV information, which automatically gives the 2:1 compression. Technically speaking, MPEG-1 is 4:2:0 YCrCb.
Frames can be partitioned into 16×16 pixel macroblocks, and each macroblock has four 8×8 luminance blocks and two 8×8 chrominance blocks, i.e., one for U and one for V. Macroblocks are the units for motion-compensated compression. Blocks are used for DCT compression.
Frame Types
In the compressed domain, frames can be encoded according to three types: intra-frames (I-frames), forward predicted frames (P-frames), and bi-directional predicted frames (B-frames).
I-Frame
An I-frame is encoded as a single image, with no reference to any past or future frames. The encoding scheme used is similar to JPEG compression. Each 8×8 block is encoded independently, with one exception described below. The block is first transformed from the spatial domain into the frequency domain using the DCT, which separates the signal into independent frequency bands. Most frequency information is in the upper left corner of the resulting 8×8 block. After this, the data are quantized according to a quantizing parameter (QP).
Quantization
Quantization can be thought of as ignoring lower-order bits, though in reality this process is more complicated. Quantization is the only lossy part of the whole compression process, other than subsampling. The resulting data are then run-length encoded in a zig-zag order to optimize compression. This zig-zag ordering produces longer runs of 0's by taking advantage of the fact that there should be little high-frequency information, i.e., a greater number of 0's, as one zig-zags from the upper left corner towards the lower right corner of the 8×8 block. The exception to independence is that the coefficient in the upper left corner of the block, which is called the DC coefficient, is encoded relative to the DC coefficient of the previous block as in differential pulse code modulation (DCPM) coding.
P-Frame
A P-frame is encoded relative to the past reference frame. A reference frame is a P- or I-frame. The past reference frame is the closest preceding reference frame in time. Each macroblock in a P-frame can be encoded either as an I-macroblock or as a P-macroblock. An I-macroblock is encoded just like a macroblock in an I-frame. A P-macroblock is encoded as a 16×16 area of the past reference frame, plus an error term. To specify the 16×16 area of the reference frame, a motion vector is included.
Motion Vector
A motion vector (0, 0) means that a 16×16 area is in the same position as the macroblock being encoded. Other motion vectors are relative to the position of that block. Motion vectors may include half-pixel values, in which case pixels are averaged. The error term is encoded using the DCT, quantization, and run-length encoding. A macroblock may also be skipped which is equivalent to a (0, 0) vector and an all-zero error term. A search for a good motion vector i.e., the one that gives a small error term and good compression, is at the core heart of any MPEG video encoder. This search is the primary factor to impacts the performance of encoders.
B-Frame
A B-frame is encoded relative to the past reference frame, the future reference frame, or both frames. The future reference frame is the closest following reference frame (I or P). The encoding for B-frames is similar to P-frames, except that motion vectors can refer to areas in the future reference frames. For macroblocks that use both past and future reference frames, the two 16×16 areas are averaged.
A number of techniques are know in the prior art for reducing the spatial resolution of video signals, see U.S. Pat. No. 5,737,019 “Method and apparatus for changing resolution by direct DCT mapping,” U.S. Pat. No. 5,926,573 “MPEG bit-stream format converter for changing resolution,” U.S. Pat. No. 6,025,878 “Method and apparatus for decoding both high and standard definition video signals using a single video decoder,” U.S. Pat. No. 6,005,623 “Image conversion apparatus for transforming compressed image data of different resolutions wherein information is scaled,” U.S. Pat. No. 6,104,434 “Video coding apparatus and decoding apparatus”
For MPEG-2, Sun et al., in “Architectures for MPEG compressed bitstream scaling,” IEEE Transactions on Circuits and Systems for Video Technology, April 1996, described four methods of rate reduction, with varying complexity and architecture. A number of transcoders are described by Vetro et al., in U.S. patent application Ser. No. 09/853,394, “Video Transcoder with Spatial Resolution Reduction, filed on May 5, 2001, also see Assuncao et al., “A frequency domain video transcoder for dynamic bit-rate reduction of MPEG-2 bitstreams,” by IEEE Transactions on Circuits and Systems for Video Technology, pp. 953–957, 1998.
Many of these methods produce bitstreams at a reduced spatial resolution reduction that lack quality, or are accomplished with high complexity. Also, proper consideration has not been given to the means by which reconstructed macroblocks are formed. This can impact both the quality and complexity, and is especially important when considering reduction factors different than two. Moreover, some of these methods do not specify any architectural details. Most of the attention is spent on various means of scaling motion vectors by a factor of two.
Therefore, it is desired to provide a method for decoding video bitstreams that overcomes the problems of the prior art methods for spatial resolution reduction. Furthermore, it is desired to provide a balance between complexity and quality in the decoder.