Video compression enables the storing, transmitting, and processing of visual information with fewer storage, network, and processor resources. The most widely used video compression standards include MPEG-1 for storage and retrieval of moving pictures, MPEG-2 for digital television, and H.263 for video conferencing, see ISO/IEC 11172-2:1993, “Information Technology—Coding of Moving Pictures and Associated Audio for Digital Storage Media up to about 1.5 Mbit/s—Part 2: Video,” D. LeGall, “MPEG: A Video Compression Standard for Multimedia Applications,” Communications of the ACM, Vol. 34, No. 4, pp. 46-58, 1991, ISO/IEC 13818-2:1996, “Information Technology—Generic Coding of Moving Pictures and Associated Audio Information—Part 2: Video,” 1994, ITU-T SG XV, DRAFT H.263, “Video Coding for Low Bitrate Communication,” 1996, ITU-T SG XVI, DRAFT13 H.263+Q15-A-60 rev.0, “Video Coding for Low Bitrate Communication,” 1997.
These standards are relatively low-level specifications that primarily deal with a spatial compression of images or frames, and the spatial and temporal compression of sequences of frames. As a common feature, these standards perform compression on a per frame basis. With these standards, one can achieve high compression ratios for a wide range of applications.
Newer video coding standards, such as MPEG-4 for multimedia applications, see ISO/IEC 14496-2:1999, “Information technology—coding of audio/visual objects, Part 2: Visual,” allow arbitrary-shaped objects to be encoded and decoded as separate video object planes (VOP). The objects can be visual, audio, natural, synthetic, primitive, compound, or combinations thereof. Also, there is a significant amount of error resilience features built into this standard to allow for robust transmission across error-prone channels, such as wireless channels.
The emerging MPEG-4 standard is intended to enable multimedia applications, such as interactive video, where natural and synthetic materials are integrated, and where access is universal. In the context of video transmission, these compression standards are needed to reduce the amount of bandwidth on networks. The networks can be wireless or the Internet. In any case, the network has limited capacity, and contention for scarce resources should be minimized.
A great deal of effort has been placed on systems and methods that enable devices to transmit the content robustly and to adapt the quality of the content to the available network resources. When the content is encoded, it is sometimes necessary to further decode the bitstream before it can be transmitted through the network at a lower bit-rate or resolution.
As shown in FIG. 1, this can be accomplished by a transcoder 100. In a simplest implementation, the transcoder 100 includes a cascaded decoder 110 and encoder 120. A compressed input bitstream 101 is fully decoded at an input bit-rate Rin, then encoded at an output bit-rate Rout 102 to produce the output bitstream 103. Usually, the output rate is lower than the input rate. In practice, full decoding and full encoding in a transcoder is not done due to the high complexity of encoding the decoded bitstream.
Earlier work on MPEG-2 transcoding has been published by Sun et al., in “Architectures for MPEG compressed bitstream scaling,” IEEE Transactions on Circuits and Systems for Video Technology, April 1996. There, four methods of rate reduction, with varying complexity and architecture, were described.
FIG. 2 shows a first example method 200, which is referred to as an open-loop architecture. In this architecture, the input bitstream 201 is only partially decoded. More specifically, macroblocks of the input bitstream are variable-length decoded (VLD) 210 and inverse quantized 220 with a fine quantizer Q1, to yield discrete cosine transform (DCT) coefficients. Given the desired output bit-rate 202, the DCT blocks are a re-quantized by a coarser level quantizer Q2 of the quantizer 230. These re-quantized blocks are then variable-length coded (VLC) 240, and a new output bitstream 203 at a lower rate is formed. This scheme is much simpler than the scheme shown in FIG. 1 because the motion vectors are re-used and an inverse DCT operation is not needed. Note, here the choice of Q1 and Q2 strictly depend on rate characteristics of the bitstream. Other factors, such as possibly, spatial characteristics of the bitstream are not considered.
FIG. 3 shows a second example method 300. This method is referred to as a closed-loop architecture. In this method, the input video bitstream is again partially decoded, i.e., macroblocks of the input bitstream are variable-length decoded (VLD) 310, and inverse quantized 320 with Q1 to yield discrete cosine transform (DCT) coefficients 321. In contrast to the first example method described above, correction DCT coefficients 332 are added 330 to the incoming DCT coefficients 321 to compensate for the mismatch produced by re-quantization. This correction improves the quality of the reference frames that will eventually be used for decoding. After the correction has been added, the newly formed blocks are re-quantized 340 with Q2 to satisfy a new rate, and variable-length coded 350, as before. Note, again Q1 and Q2 are rate based.
To obtain the correction component 332, the re-quantized DCT coefficients are inverse quantized 360 and subtracted 370 from the original partially decoded DCT coefficients. This difference is transformed to the spatial domain via an I inverse DCT (IDCT) 365 and stored into a frame memory 380. The motion vectors 381 associated with each incoming block are then used to recall the corresponding difference blocks, such as in motion compensation 290. The corresponding blocks are then transformed via the DCT 332 to yield the correction component. A derivation of the method shown in FIG. 3 is described in “A frequency domain video transcoder for dynamic bit-rate reduction of MPEG-2 bitstreams,” by Assuncao et al., IEEE Transactions on Circuits and Systems for Video Technology, pp. 953-957, 1998.
Assuncao et al. also described an alternate method for the same task. In the alternative method, they used a motion compensation (MC) loop operating in the frequency domain for drift compensation. Approximate matrices were derived for fast computation of the MC blocks in the frequency domain. A Lagrangian optimization was used to calculate the best quantizer scales for transcoding. That alternative method removed the need for the IDCT/DCT components.
According to prior art compression standards, the number of bits allocated for encoding texture information is controlled by a quantization parameter (QP). The above methods are similar in that changing the QP based on information that is contained in the original bitstream reduces the rate of texture bits. For an efficient implementation, the information is usually extracted directly from the compressed domain and can include measures that relate to the motion of macroblocks or residual energy of DCT blocks. The methods describes above are only applicable for bit-rate reduction.
Besides bit-rate reduction, other types of transformation of the bitstream can also be performed. For example, object-based transformations have been described in U.S. patent application Ser. No. 09/504,323, “Object-Based Bitstream Transcoder,” filed on Feb. 14, 2000 by Vetro et al. Transformations on the spatial resolution have been described in “Heterogeneous video transcoding to lower spatio-temporal resolutions, and different encoding formats,” IEEE Transaction on Multimedia, June 2000, by Shanableh and Ghanbari.
It should be noted these methods produce bitstreams at a reduced spatial resolution reduction that lack quality, or are accomplished with high complexity. Also, proper consideration has not been given to the means by which reconstructed macroblocks are formed. This can impact both the quality and complexity, and is especially important when considering reduction factors different than two. Moreover, these methods do not specify any architectural details. Most of the attention is spent on various means of scaling motion vectors by a factor of two.
FIG. 4 shows the details of a method 400 for transcoding an input bitstream to an output bitstream 402 at a lower spatial resolution. This method is an extension of the method shown in FIG. 1, but with the details of the decoder 110 and encoder 120 shown, and a down-sampling block 410 between the decoding and encoding processes. The decoder 110 performs a partial decoding of the bitstream. The down-sampler reduces the spatial resolution of groups of partially macroblocks. Motion compensation 420 in the decoder uses the full-resolution motion vectors mvf 421, while motion compensation 430 in the encoder uses low-resolution motion vectors mvr 431. The low-resolution motion vectors are either estimated from the down-sampled spatial domain frames yn1 403, or mapped from the full-resolution motion vectors. Further detail of the transcoder 400 are described below.
FIG. 5 shows the details of an open-loop method 500 for transcoding an input bitstream 501 to an output bitstream 502 at a lower spatial resolution. In this method, the video bitstream is again partially decoded, i.e., macroblocks of the input bitstream are variable-length decoded (VLD) 510 and inverse quantized 520 to yield discrete cosine transform (DCT) coefficients, these steps are well known.
The DCT macroblocks are then down-sampled 530 by a factor of two by masking the high frequency coefficients of each 8×8 (23×23)luminance block in the 16×16 (24×24) macroblock to yield four 4×4 DCT blocks, see U.S. Pat. No. 5,262,854, “Low-resolution HDTV receivers,” issued to Ng on Nov. 16, 1993. In other words, down-sampling turns a group of blocks, for example four, into a group of four blocks of a smaller size.
By performing down-sampling in the transcoder, the transcoder must take additional steps to re-form a compliant 16×16 macroblock, which involves transformation back to the spatial domain, then again to the DCT domain. After the down-sampling, blocks are re-quantized 540 using the same quantization level, and then variable length coded 550. No methods have been described to perform rate control on the reduced resolution blocks.
To perform motion vector mapping 560 from full 559 to reduced 561 motion vectors, several methods suitable for frame-based motion vectors have been described in the prior art. To map from four frame-based motion vectors, i.e., one for each macroblock in a group, to one motion vector for the newly formed 16×16 macroblock, simple averaging or median filters can be applied. This is referred to as a 4:1 mapping.
However, certain compression standards, such as MPEG-4 and H.263, support advanced prediction modes that allow one motion vector per 8×8 block. In this case, each motion vector is mapped from a 16×16 macroblock in the original resolution to an 8×8 block in the reduced resolution macroblock. This is referred to as a 1:1 mapping.
FIG. 6 shows possible mappings 600 of motion vector from a group of four 16×16 macroblocks 601 to either one 16×16 macroblock 602 or four 8×8 macroblocks 603. It is inefficient to always use the 1:1 mapping because more bits are used to code four motion vectors. Also, in general, the extension to field-based motion vectors for interlaced images is non-trivial. Given the down-sampled DCT coefficients and mapped motion vectors, the data are subject to variable length coding and the reduced resolution bitstream can be formed as is well known.
It is desired to provide a method for transcoding bitstreams that overcomes the problems of the prior art methods for spatial resolution reduction. Furthermore, it is desired to provide a balance between complexity and quality in the transcoder. Furthermore it is desired to compensate for drift, and provide better up-sampling techniques during the transcoding.