Multimedia containing various types of content including text, audio and video, provides an outstanding business and revenue opportunity for network operators. The ready availability of high speed networks and the use of packet-switched Internet Protocol (IP) technology have made it possible to transmit richer content that include various combinations of text, voice, still and animated graphics, photos, video clips, and music. For exploiting this market, network operators must meet customers' expectations regarding quality and reliability. Transcoding of media at the server level is crucial for rendering multimedia applications in today's heterogeneous networks composed of mobile terminals, cell phones, computers and other electronic devices. The adaptation and transcoding of media must be performed at the service provider level because individual devices are often resource constrained and are therefore not capable of adapting the media on their own. This is an important problem for service providers, as they will have to face a very steep traffic growth in the next few years; growth that far exceeds the speed up one can obtain from new hardware alone.
As discussed by A. Vetro, C. Christopoulos, and H. Sun, in Video transcoding architectures and techniques: An overview, IEEE Signal Processing Magazine, vol. 20, pp. 18-29, 2003, multimedia bit-streams often need to be converted from one form to another. Transcoding is the operation of modifying and adapting the content of a pre-compressed bit-stream into another video bit-stream. Each bit-stream is characterized by a group of properties: the bit rate, the spatial resolution, the frame rate, and the compression format used to encode the video bit-stream. A group of video properties represent a video format. The role of the transcoder is becoming important in our daily life for maintaining a high level interoperability in multimedia systems where each component might have its own features and capabilities. For examples, the final user appliances include a diversity of devices such as PDA, mobile, PCs, laptops and TVs. Moreover, the networks that connect those devices are heterogeneous and can be wired or wireless with different channel characteristics. Finally, there is a huge number of video services that have been used since a couple of years such as broadcasting, video streaming, TV on demand, Blu-ray DVD. Thus, transcoding is a key technology to provide universal multimedia access (UMA).
In the past few decades, many video standards have been developed especially by International Organization for Standardization (ISO) (MPEG-1/2 and MPEG-4) and the International Telecommunication Unit (ITU-T) (H.261/H.263). ISO/IEC and ITU-T VCEG have jointly developed H.264/AVC, the recent codec that aims to achieve very high rates of compression while preserving the same visual quality compared to the predecessor standards providing similar rates. This is described by T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J. Sullivan, Rate-constrained coder control and comparison of video coding standards, IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 688-703, 2003. Even though a great number of compression formats have been standardized to target different video applications and services, video standards have a large number of properties in common as described in the publication by Schwartz, Kossentini and Sullivan. They all use block-based motion estimation/motion compensation (ME/MC) where the basic unit is a macroblock (MB) of 16×16 pixels. In addition, they share approximately the same operation building blocks: variable length decoding (VLC), variable length encoding (VLE), quantization (Q), inverse quantization (IQ), transforms such as discrete cosine transform (DCT) or its enhanced successor for the H.264/AVC, integer discrete cosine transform, inverse transform (IDCT), motion estimation (ME) and motion compensation (MC). Moreover, they all employ the concept of profiles and levels to target different classes of applications.
It is possible to use the cascaded architecture available in the prior art to perform bit rate reduction, temporal resolution adaptation (up scaling or down scaling), spatial resolution adaptation with compression format changing, logo insertion with spatial resolution adaptation, etc. The cascaded architecture allows calculating the best MVs and MB modes for the generated stream as it performs a new motion estimation (ME) using the modified sequence of images in the pixel-domain. Unfortunately, since the computationally intensive ME must be redone, the cascaded architecture is quite expensive in terms of computational complexity and is thus undesirable for real-time applications and commercial software.
To address this problem, novel video transcoding architectures have been proposed in the pixel domain (spatial domain) or DCT domain (frequency domain) to reduce this computational complexity while maintaining the highest possible quality of the re-encoded video and is described for example in the paper by A. Vetro, C. Christopoulos, and H. Sun. These architectures exploit, at the encoding stage, the information obtained at the decoding stage (MB modes, MVs, etc.) to perform bit rate adaptation; spatial resolution adaptation; frame rate adaptation; logo insertion and compression format changing. Nevertheless, the majority of these transcoders are standalone and address the implementation of a single use case that is characterized by a set of transcoding operations and specifies the type of adaptation to be performed on the input image. Some existing works including the publication by H. Sun, W. Kwok, and J. W. Zdepski, Architectures for MPEG compressed bitstream scaling, IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, pp. 191-199, 1996 have implemented the bit rate reduction. The motion vectors (MVs) and mode mapping in these works were the major problems addressed in their works. The spatial resolution adaptation was presented in the publication by Vetro, Christopoulos and Sun where re-sampling, MVs and modes mapping were the main issues that have been addressed. The process of deriving MVs for the new generated frames was discussed by Y. Jeongnam, S. Ming-Ting, and L. Chia-Wen, in Motion vector refinement for high-performance transcoding, IEEE Transactions on Multimedia, vol. 1, pp. 30-40, 1999 to adapt the frame rate to the desired temporal resolution. For the watermark/logo insertion, works such as the one presented by K. Panusopone, X. Chen, and F. Ling, in Logo insertion in MPEG transcoder, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. vol. 2: Institute of Electrical and Electronics Engineers Inc., 2001, pp. 981-984 have considered the issues of minimal changes in the macroblocks (MBs) where the logo is inserted (logo-MBs) while reusing the same motion vectors for the non modified MBs. The transcoding from one compression format to another compression format has been implemented in a number of works including the research by T. Shanableh and M. Ghanbari, in Heterogeneous video transcoding to lower spatio-temporal resolutions and different compression formats, IEEE Transactions on Multimedia, vol. 2, pp. 101-110, 2000. These works have tried to address the issues of modes and MVs mapping as well as syntax conversion.
Despite the fact that many video transcoding architectures and algorithms have been proposed, only few of the existing works have investigated the possibility to integrate the majority of the transcoding use cases in the same transcoder. N. Feamster and S. Wee in An MPEG-2 to H.263 transcoder, Multimedia Systems and Applications II. vol. 3845: SPIE-Int. Soc. Opt. Eng, 1999, pp. 164-75. have presented a transcoder for compression format changing from MPEG-2 to H.263 with a possibility of performing spatio-temporal resolution reduction. However, the authors have used a simple algorithm for temporal resolution reduction by dropping the B-frames of the input MPEG-2 stream. In the paper by Shanableh and Ghanbari, the authors have implemented a heterogeneous transcoder for compression format changing of MPEG-1/2 to lower bit rate H.261/H.263. Even though the algorithm of each use case was fairly detailed, a procedure to perform a combination of multiple use cases was missing. In the paper by X. Jun, S. Ming-Ting, and C. Kangwook, Motion Re-estimation for MPEG-2 to MPEG-4 Simple Profile Transcoding, Int. Packet Video Workshop Pittsburgh, 2002, the authors have proposed an architecture to perform transcoding from interlaced MPEG-2 to MPEG-4 simple profile with spatio-temporal resolution reduction. However, the work has reported only a limited set of experimental results for validating the proposed transcoder. Unfortunately, these few transcoders available in prior art were limited to perform a specific group of use cases (spatio-temporal resolution reduction with compression format conversion) where the compression format conversion was performed between two specific standards (MPEG-2 to H.263 or MPEG-2 to MPEG-4). In addition, they lack flexibility: they can only perform the proposed group of use cases.
These proposed architectures from prior art typically comprises units for decoding the compressed video stream, performing manipulations in the pixel domain including scaling, logo insertion and re-encoding to meet the output requirements. This is the most straightforward approach to perform any video transcoding use case or set of use cases. The operations of such an architecture are explained with the help of FIG. 1(a). A flowchart for a typical method deployed in such an architecture is presented in FIG. 1(b). The input sequence of images coded in an input format characterized by a predefined standard bit rate BR1, spatial resolution SR1 temporal resolution TR1 and compression format CF1 is presented to the input of decoder 102 that performs a full decoding in the pixel domain. The MB modes and MVs are stored. The output of the decoder 102 comprises frames, MB modes and MVs. The MVs and MB modes are fed to the input of the motion vector mapping and refinement unit 104 whereas the frames and MB modes are applied to the input of the spatio-temporal resolution adaptation unit 106 that produces frames and MB modes adapted to the desired spatio-temporal resolution. The MV mapping is done for spatio-temporal resolution reduction with compression format adaptation. The MVs are mapped in the output format to match the new spatio-temporal resolution and then, they are refined. The spatio-temporal resolution adaptation unit 106 reduces the frame rate to the predefined temporal resolution for the output format that corresponds to the sequence of output images. The frame's resolution is downscaled to the new resolution (often, the frame is downscaled by a factor of 2). The MVs produced as an output of the motion vector mapping and refinement unit 104 as well as the output of the spatio-temporal resolution adaptation unit 106 are presented to the input of the re-encoding unit 108 where the final MB modes are determined and a new residue is calculated. The residue is the difference between the original (input) MB and the predicted MB. It is computed as: R(x, y)=I(x, y)−P(x, y), 0≦x, y≦15, where I(x, y) and P(x, y) represent the pixel value at position (x,y) within the original and predicted MBs respectively. Typically, an encoder will select a set of MB mode(s) and MV(s) leading to a predicted MB minimizing a certain rate-distortion cost function.
The re-encoding unit 108 produces the output video encoded in an output format characterized by a predefined standard bit rate BR2, spatial resolution SR2, temporal resolution TR2 and compression format CF2.
A typical method used in prior art for performing a transcoding use case is explained with the help of flowchart 150 presented in FIG. 1(b). Upon start (box 152), procedure 150 inputs the compressed video presented in the form of a sequence of input images (box 154). Each image is then decoded in the next step (box 156). The MVs and MB modes obtained after the decoding are stored (box 158). The procedure 150 then performs the mapping of the MVs and the MB modes for the predefined set of operations corresponding to a set of one or more use cases to be performed (box 160).
For each output MB, all candidate MVs are checked (box 162). In the next step, the procedure 150 re-encodes the image (box 164). After the re-encoding operation the procedure checks whether more images need to be processed (box 166). If so, the procedure exits ‘YES’ from box 166, gets the next image (box 168) and loops back to the input of box 154. Otherwise the procedure exits ‘NO’ from box 166, produces the sequence of output images that correspond to the output video (box 170) and exits (box 172). Note that this procedure used in prior art can perform only a single predefined set of operations and does not have the flexibility to handle any arbitrary combination of transcoding use cases.
Even though these transcoding architectures from prior art exploit, at the encoding stage, the information obtained at the decoding stage (modes, MVs, etc.), most of these proposed transcoding architectures address a single use case. However, for real-life systems, it is highly undesirable to have a customized transcoding system for each transcoding use case. Such an approach would lead to a prohibitively high software development and maintenance costs.
Therefore, there is a strong requirement for developing an architecture supporting several arbitrary transcoding use cases in the same computationally efficient transcoder.