This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
Multimedia applications include local playback, streaming or on-demand, conversational and broadcast/multicast services. Interoperability (IOP) among these services is important for the fast deployment and large-scale market formation of each multimedia application. To achieve high IOP, different standards are specified.
Typical audio and video coding standards specify profiles and levels. A profile is a subset of algorithmic features of the standard. A level is a set of limits to the coding parameters that impose a set of constraints in decoder resource consumption. The profile and level can be used to signal properties of a media stream, as well as to signal the capability of a media decoder. Each pair of profile and level forms an “interoperability point.”
Through the combination of a profile and a level, a decoder can declare, without actually attempting the decoding process, whether it is capable of decoding a stream. If the decoder is not capable of decoding the stream, it may cause the decoder to crash, operate slower than real-time, and/or discard data due to buffer overflows.
Technologies involved in multimedia applications include, among others, media coding, storage and transmission. Media types include speech, audio, image, video, graphics, time text, etc. Although the description contained herein is applicable to all media types, video is described herein as an example.
Different standards have been specified for different technologies. Video coding standards include the ITU-T H.261, ISO/IEC MPEG-1 Visual, ITU-T H.262 or ISO/IEC MPEG-2 Visual, ITU-T H.263, ISO/IEC MPEG-4 Visual and ITU-T H.264 (also known as ISO/IEC MPEG-4 Advanced Video Coding (AVC) or, in short, H.264/AVC). In addition, there are currently efforts underway to develop new video coding standards. One such standard under development is the scalable video coding (SVC) standard, which will become the scalable extension to the H.264/AVC standard. Another such standard under development is the multi-view video coding (MVC) standard, which will become another extension to the H.264/AVC standard. The latest draft of the SVC standard, the Joint Draft 8.0, is available in JVT-U201, “Joint Draft 8 of SVC Amendment”, 21st JVT meeting, HangZhou, China, October 2006. The latest joint draft of the MVC standard is available in JVT-U209, “Joint Draft 1.0 on Multiview Video Coding”, 21st JVT meeting, HangZhou, China, October 2006. The latest draft of video model of the MVC standard is described in JVT-U207, “Joint Multiview Video Model (JMVM) 2.0”, 21st JVT meeting, HangZhou, China, October 2006. The content of all of these documents are incorporated herein by reference in their entireties.
Scalable media is typically ordered into hierarchical layers of data. A base layer contains an individual representation of a coded media stream such as a video sequence. Enhancement layers contain refinement data relative to previous layers in the layer hierarchy. The quality of the decoded media stream progressively improves as enhancement layers are added to the base layer. An enhancement layer enhances the temporal resolution (i.e., the frame rate), the spatial resolution, or simply the quality of the video content represented by another layer or part thereof. Each layer, together with all of its dependent layers, is one representation of the video signal at a certain spatial resolution, temporal resolution and quality level. Therefore, the term “scalable layer representation” is used herein to describe a scalable layer together with all of its dependent layers. The portion of a scalable bitstream corresponding to a scalable layer representation can be extracted and decoded to produce a representation of the original signal at certain fidelity.
An encoded bitstream according to H.264/AVC or its extensions, e.g. SVC and MVC, is either a NAL unit stream, or a byte stream by prefixing a start code to each NAL unit in a NAL unit stream. A NAL unit stream is simply a concatenation of a number of NAL units. A NAL unit is comprised of a NAL unit header and a NAL unit payload. The NAL unit header contains, among other items, the NAL unit type indicating whether the NAL unit contains a coded slice, a coded slice data partition, a sequence or picture parameter set, and so on. The video coding layer (VCL) contains the signal processing functionality of the codec; mechanisms such as transform, quantization, motion-compensated prediction, loop filter, inter-layer prediction. A coded picture of a base or enhancement layer consists of one or more slices. The NAL encapsulates each slice generated by the video coding layer (VCL) into one or more NAL units.
Coded video bitstreams may include extra information to enhance the use of the video for a wide variety purposes. For example, supplemental enhancement information (SEI) and video usability information (VUI), as defined in H264/AVC, provide such a functionality. The H.264/AVC standard and its extensions include the support of supplemental enhancement information (SEI) signaling through SEI messages. SEI messages are not required by the decoding process to generate correct sample values in output pictures. Rather, they are helpful for other purposes, e.g., error resilience and display. H.264/AVC contains the syntax and semantics for the specified SEI messages, but no process for handling the messages in the recipient is defined. Consequently, encoders are required to follow the H.264/AVC standard when they create SEI messages, and decoders conforming to the H.264/AVC standard are not required to process SEI messages for output order conformance. One of the reasons to include the syntax and semantics of SEI messages in H.264/AVC is to allow system specifications, such as 3GPP multimedia specifications and DVB specifications, to interpret the supplemental information identically and hence interoperate. It is intended that system specifications can require the use of particular SEI messages both in encoding end and in decoding end, and the process for handling SEI messages in the recipient may be specified for the application in a system specification.
Available media file format standards include ISO file format (ISO/IEC 14496-12), MPEG-4 file format (ISO/IEC 14496-14), AVC file format (ISO/IEC 14496-15) and 3GPP file format (3GPP TS 26.244). The SVC file format, the file format standard for storage of SVC video, is currently under development, as an extension to the AVC file format. The latest SVC file format draft is available in MPEG document N8663.
3GPP TS 26.140 specifies the media types, formats and codecs for the multimedia messaging services (MMS) within the 3GPP system. 3GPP TS 26.234 specifies the protocols and codecs for the packet-switched streaming services (PSS) within the 3GPP system. The ongoing 3GPP TS 26.346 specifies the protocols and codecs for multimedia broadcast/multicast services (MBMS) within the 3GPP system.
Available video coding standards specify buffering models and buffering parameters for the bitstreams. Such buffering models are referred to as Hypothetical Reference Decoder (HRD) or Video Buffer Verifier (VBV). A standard compliant bitstream must comply with the buffering model with a set of buffering parameters specified in the corresponding standard. Such buffering parameters for a bitstream may be explicitly or implicitly signaled. “Implicitly signaled” means that the default buffering parameter values according to the profile and level apply. The HRD/VBV parameters are primarily used to impose constraints on the bit rate variations of compliant bitstreams.
United States Application Publication No. 2005/0254575, filed May 12, 2004 and incorporated by reference herein in its entirety, describes a process for the signaling of multiple operation points for scalable media streams. According to the process described in this publication, an operation point, such as any subset of profile, level, and HRD/VBV parameters, can be associated with any valid subset of layers of the scalable media stream. Profile and level information, among others, for scalable layers is included in the scalability information supplemental enhancement information (SEI) message of the SVC specification (JVT-U201, the contents of which are incorporated herein by reference in its entirety).
For a scalable bitstream, each scalable layer, together with the layers on which the scalable layer depends, can be extracted as a subset of the scalable bitstream. Transcoding, as defined as follows, is not needed to extract a scalable layer and its dependent layers. At least part of the coded media stream resulting from the transcoding process is not a subset of the original coded media stream input to the transcoding process. Extraction of a scalable media stream is not classified as transcoding, as the resulting stream from the truncation process is a subset of the original stream.
The following is a discussion of a number of representative transcoding use cases. One function of media-aware network elements (MANEs) is to ensure that the recipient of the media contents is capable of decoding and presentation of the media contents. MANEs include devices such as gateways, multipoint conference units (MCUs), Real-Time Transport Protocol (RTP) translators, RTP mixers, multimedia messaging centers (MMSCs), push-to-talk over cellular (PoC) servers, Internet Protocol (IP) encapsulators in digital video broadcasting-handheld (DVB-H) systems, or set-top boxes that forward broadcast transmissions locally to home wireless networks, for example. In order to guarantee successful decoding and presentation, a MANE may have to convert the input media stream to a format that complies with the capabilities of the recipient. Transcoding the media stream is one way of converting the media stream. In another situation, because a device may not be capable of decoding an input media stream in real-time, the input media stream is transcoded more slowly than real-time, e.g. as a background operation. The transcoded stream can then be decoded and played in real-time.
The coding format for a media stream (coding format A, e.g. a scalable bitstream) can be transcoded to another coding format (coding format B, e.g. a non-scalable bitstream). Coding format B may be preferable in some environments due to a larger number of devices supporting coding format B compared to the number of devices supporting coding format A. Hence, transcoding of coding format A to coding format B may be carried out in the originator or sender of the media stream of coding format A. For example, a bitstream coded with SVC can be transcoded to a plain H.264/AVC bitstream. The number of H.264/AVC devices exceeds the number of SVC devices. Therefore, in some applications, the transcoding of SVC streams to H.264/AVC may be preferred to support a larger number of receiver devices.
A straightforward, yet also highly computationally intensive transcoding method involves fully decoding the bitstream and then re-encode the decoded sequence. There are also many transcoding technologies that operate in the transform domain instead of in the pixel domain as the most straightforward method. Video transcoding technologies are discussed in detail in A. Vetro, C. Christopoulos, and H. Sun, “Video Transcoding Architectures and Techniques: An Overview,” IEEE Signal Processing Magazine, vol. 20, no. 2, pp. 18-29, March 2003, the contents of which are incorporated herein by reference in its entirety.
In addition to more traditional transcoding technologies, some lightweight transcoding of SVC or MVC bitstreams to H.264/AVC bitstreams is possible, due to the fact that SVC and MVC are H.264/AVC extensions, and many of the coding tools, are similar. One example of the lightweight transcoding of a SVC bitstream, with certain restrictions, to a H.264/AVC bitstream has been shown in JVT-U043, the contents of which are incorporated herein by reference. This method is referred to as the first lightweight transcoding method.
Another example of lightweight transcoding is described as follows. The base layer of SVC streams can be decoded by H.264/AVC decoders when the enhancement layers are also feed to the H.264/AVC decoders. This is achieved by using such Network Abstraction Layer (NAL) unit types that are reserved in the H.264/AVC standard and therefore are ignored by H.264/AVC decoders for SVC enhancement layer data. SVC streams can sometimes contain more than one independently coded layer, i.e. a layer that is not inter-layer predicted from any other layer. However, only one of these layers can be coded as a H.264/AVC compatible base layer in order to maintain the backward compatibility of the SVC standard with the H.264/AVC standard and decoders. The latest SVC design permits an independent SVC layer to be converted to a H.264/AVC bitstream with modifications to the NAL unit header only. The modifications comprise removing the SVC NAL unit header extension bytes and changing the value of the syntax element nal_unit_type as follows. If the original nal_unit_type value is equal to 20, then it is changed to 1. If the original nal_unit_type value is equal to 21, then it is changed to 5. This method is referred to as the second lightweight transcoding method.
For both of the above lightweight transcoding methods, the parameter sets (both sequence parameter sets (SPSs) and picture parameter sets (PSSs)) that are not referred to by the target layer and the required lower layers that have been transcoded should be discarded, while the SPSs that are referred to by the target layer and the required lower layers must be changed accordingly. For example, the profile and level information (i.e. the beginning third bytes in the SPS) must be changed to contain the corresponding information of the transcoded bitstream, and the SPS SVC extension (seq_parameter_set_svc_extension( )), if present, must be removed. In addition, if there are SEI messages that are contained in scalable nesting SEI messages in the original SVC bitstream for the target layer, those SEI messages must then appear in the transcoded bitstream in their original forms, i.e., not contained in scalable nesting SEI messages. For the first lightweight transcoding, modifications to the NAL unit header (the same as for the second lightweight transcoding method) and the slice header are also required.
Like in the case of SVC, the base view of any MVC stream is compatible with the H.264/AVC standard and can be decoded with H.264/AVC decoders, as MVC NAL units use only those NAL unit types that were reserved in the H.264/AVC standard. However, there might be multiple independent views, i.e. views that are not inter-view predicted from any other view, in a single MVC stream. These independent views could be converted to an H.264/AVC stream with modifications to the NAL unit header only. It is noted that the independent MVC views may also be compliant with SVC other than H.264/AVC.
Currently, it cannot be determined whether a media stream is encoded in such a manner that, when it is transcoded by a certain transcoding process, the resulting bitstream complies with a certain interoperability point. Currently, the only system for determining the interoperability point for a transcoded stream has been to run the transcoded stream through a stream verifier, such as HRD/VBV, returning the interoperability point of the stream. This is computationally costly and requires the presence of a verifier coupled with a transcoder. For lightweight transcoding methods and some other low-complexity transcoding methods, e.g. some transform-domain methods, the complexity of the transcoder itself would be much lower than the verifier. Furthermore, when a recipient has requested for a stream conforming to a certain interoperability point, transcoding and transmission of the transcoded stream may not be performed simultaneously if the additional stream verifier is running at the same time.
Joint Video Team document JVT-U044 (incorporated herein by reference in its entirety) proposes an addition to the scalability information SEI message signaling the average and maximum bitrates resulting from a transcoding operation. However, these pieces of information are not sufficient for a decoder implementation to determine whether it can decode the transcoded stream in real-time.
There is therefore a need of an improved system and method of signaling the IOP information of the transcoded bitstreams for low-complexity transcoding methods.