Digital video systems have become increasingly important in the communication and broadcasting industries. The International Standards Organization (ISO) has established a series of standards to facilitate the standardisation of compression and transmission of digital video signals. One of the standards, ISO/IEC 1318-2 entitled “Generic Coding of Moving Picture and Associated Audio Information” (or MPEG-2 in short, where “MPEG” stands for “Moving Picture Experts Group”) was developed in late 1990's. MPEG-2 has been used to encode digital video for a wide range of applications, including the Standard Definition Television (SDTV) and the High Definition Television (HDTV) systems.
A commonly used process in the ISO series of video coding standards is “motion estimation”, whose objective is to exploit similarities between adjacent pictures, thus reducing the amount of information that needs to be encoded. Prior to performing the motion estimation process (known as ME), an encoder first sub-divides the current picture into a discrete set of non-overlapping regions known as the coding units. In ME, the encoder examines each coding unit in turn and searches for a region in a previously encoded picture that best matches the current coding unit. Such a region forms the prediction block for the current coding unit. The encoder then computes the pixel-wise difference (which represents the prediction error) between the current coding unit and its prediction block. The encoder also generates a motion vector describing the spatial displacement between the current coding unit and its prediction block. Since a decoder typically reads the motion vector before the prediction error, the prediction error is commonly referred to as the motion residue associated with the motion vector.
In a typical MPEG-2 encoding process, there are three types of pictures defined by the MPEG-2 standard. These picture types are referred to as the “I-picture”, the “P-picture”, and the “B-picture”. Digitized luminance and chroma components of video pixels are first input to the video encoder and stored into macroblock (MB) structures. Then, according to the selected picture type, Discrete Cosine Transform (DCT) and/or ME techniques are used at the MB level to exploit the spatial and temporal redundancy of the video signal thereby achieving compression. Detailed processes for encoding each of three picture types are described as follows.
The I-picture represents an Intra-coded picture that can be reconstructed without referring to the data in other pictures. Luminance and chroma data of each MB unit in an I-picture are first transformed to the frequency domain using a block-based DCT, to exploit spatial redundancy that may be present in the I-picture. Then the high frequency coefficients of each DCT block in the MB unit are coarsely quantized according the characteristics of the human visual system. The quantized DCT coefficients are further compressed using Run-Level Coding (RLC) and Variable Length Coding (VLC) before finally being output into the compressed video bit-stream.
Both the P-picture and the B-picture represent inter-coded pictures that are coded using motion compensation data based upon other pictures.
FIG. 1 illustrates an example of inter-coding of a P picture 101. For an MB 104 which is going to be inter-coded in the current picture 101 in question, the ME technique is used to discover the temporal redundancy with respect to reference pictures. The term “reference pictures” refers to the pictures adjoining the current picture in a temporal order, such as the “previous picture” 102 and the “next picture” 103 in FIG. 1. The ME technique discovers the temporal redundancy by searching in a search area 105 in the reference picture 102 to find a block which minimizes a difference criterion (such as mean square error) between itself and the MB 104 in the current picture 101. The block 106 in the reference picture 102 that minimises the aforementioned difference criterion over the search area 105 is referred to as “the best match block”. After locating the best match block 106, the displacements between the current picture 101 and the reference picture 102 along the horizontal direction (X) and the vertical direction (Y) are determined to form a motion vector (MV) 107 which is associated with the MB 104. Then the pixel-wise difference (also referred to as “motion residue”) between the current MB 104 and its best match block 106 is spatially compressed using block-based DCT and scalar quantization. Finally, the motion vector and quantized motion residues generated by the above process are entropy-encoded using VLC to form the compressed video bit-stream.
A primary difference between a P-picture and a B-picture is the fact that a B-picture accommodates temporal prediction from future reference pictures whist a P picture does not. The MB 104 in the P-picture 101 only has one MV 107 which corresponds to the best match block 106 in the previous (reference) picture 102. In contrast, a MB in a B-picture (also referred to as a “bidirectional-coded MB”) may have two MV values, one “forward MV” which corresponds to the best mapping block in the previous picture (similar to the vector 107 in FIG. 1), and one “backward MV” which corresponds to another best match block in the next picture (i.e., the vector 109 pointing to the block 108 in the reference picture 103). The motion residue of a bidirectional-coded MB is determined as an average of the motion residues produced by the forward MV and by the backward MV.
With the diversity of digital video applications, it is often necessary to convert a compressed MPEG-2 bit-stream from one resolution to another. Examples of such applications include conversion from HDTV to SDTV, and conversion from one bit-rate to a different bit-rate for re-transmission. In this description the input (having a first resolution) to a resolution conversion module is referred to as the “input stream” (or input compressed stream if appropriate), and the output (having a second resolution) from the resolution conversion module is referred to as the “scaled output stream” (or scaled compressed output stream if appropriate).
A straightforward solution for implementing the aforementioned resolution conversion applications is a “tandem transcoder”, in which a standard MPEG-2 decoder and a standard MPEG-2 encoder are cascaded together to provide the required resolution and bit-rate conversions. However, the process of fully decoding and subsequently encoding MPEG-2 compressed bit-streams demands heavy computational resources, particularly due to the computationally-intensive ME operations in the standard MPEG-2 encoder Therefore the tandem transcoding approach is not considered to be an efficient solution for resolution or bit-rate conversion of compressed bit-streams.