This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
Multimedia applications include local playback, streaming or on-demand, conversational and broadcast/multicast services. Technologies involved in multimedia applications include, for example, media coding, storage and transmission. Media types include speech, audio, image, video, graphics and time text. Different standards have been specified for different technologies.
There are a number of video coding standards including ITU-T H.261, ISO/IEC MPEG-1 Visual, ITU-T H.262 or ISO/IEC MPEG-2 Visual, ITU-T H.263, ISO/IEC MPEG-4 Visual and ITU-T H.264 or ISO/IEC MPEG-4 AVC. H.264/AVC is the work output of a Joint Video Team (JVT) of ITU-T Video Coding Experts Group (VCEG) and ISO/IEC MPEG. There are also proprietary solutions for video coding (e.g. VC-1, also known as SMPTE standard 421M, based on Microsoft's Windows Media Video version 9), as well as national standardization initiatives, for example AVS codec by Audio and Video Coding Standard Workgroup in China. Some of these standards already specify a scalable extension, e.g. MPEG-2 visual and MPEG-4 visual. For H.264/AVC, the scalable video coding extension SVC, sometimes also referred to as SVC standard, is currently under development.
The latest draft of the SVC is described in JVT-T201, “Joint Draft 7 of SVC Amendment,” 20th JVT Meeting, Klagenfurt, Austria, July 2006, available from http://ftp3.itu.ch/av-arch/jvt-site/2006_07_Klagenfurt/JVT-T201.zip.
SVC can provide scalable video bitstreams. A portion of a scalable video bitstream can be extracted and decoded with a degraded playback visual quality. A scalable video bitstream contains a non-scalable base layer and one or more enhancement layers. An enhancement layer may enhance the temporal resolution (i.e. the frame rate), the spatial resolution, or simply the quality of the video content represented by the lower layer or part thereof. In some cases, data of an enhancement layer can be truncated after a certain location, even at arbitrary positions, and each truncation position can include some additional data representing increasingly enhanced visual quality. Such scalability is referred to as fine-grained (granularity) scalability (FGS). In contrast to FGS, the scalability provided by a quality enhancement layer that does not provide fined-grained scalability is referred as coarse-grained scalability (CGS). Base layers can be designed to be FGS scalable as well.
The mechanism for providing temporal scalability in the latest SVC specification is referred to as the “hierarchical B pictures” coding structure. This feature is fully supported by Advanced Video Coding (AVC), and the signaling portion can be performed by using sub-sequence-related supplemental enhancement information (SEI) messages.
The mechanism for providing temporal scalability in the latest SVC specification is referred to as the “hierarchical B pictures” coding structure. This feature is fully supported by AVC, and the signaling portion can be performed by using sub-sequence-related supplemental enhancement information (SEI) messages.
For mechanisms to provide spatial and CGS scalabilities, a conventional layered coding technique similar to that used in earlier standards is used with some new inter-layer prediction methods. Data that could be inter-layer predicted includes intra texture, motion and residual data. Single-loop decoding is enabled by a constrained intra texture prediction mode, whereby the inter-layer intra texture prediction can be applied to macroblocks (MBs) for which the corresponding block of the base layer is located inside intra MBs. At the same time, those intra MBs in the base layer use constrained intra prediction. In single-loop decoding, the decoder needs to perform motion compensation and full picture reconstruction only for the scalable layer desired for playback (called the desired layer). For this reason, the decoding complexity is greatly reduced. All of the layers other than the desired layer do not need to be fully decoded because all or part of the data of the MBs not used for inter-layer prediction (be it inter-layer intra texture prediction, inter-layer motion prediction or inter-layer residual prediction) are not needed for reconstruction of the desired layer.
The spatial scalability has been generalized to enable the base layer to be a cropped and zoomed version of the enhancement layer. The quantization and entropy coding modules were adjusted to provide FGS capability. The coding mode is referred to as progressive refinement, wherein successive refinements of the transform coefficients are encoded by repeatedly decreasing the quantization step size and ap plying a “cyclical” entropy coding akin to sub-bitplane coding.
The scalable layer structure in the current draft SVC standard is characterized by three variables, referred to as temporal_level, dependency_id and quality_level, that are signaled in the bit stream or can be derived according to the specification. temporal_level is used to indicate the temporal layer hierarchy or frame rate. A layer comprising pictures of a smaller temporal_level value has a smaller frame rate than a layer comprising pictures of a larger temporal_level. dependency_id is used to indicate the inter-layer coding dependency hierarchy. At any temporal location, a picture of a smaller dependency_id value may be used for inter-layer prediction for coding of a picture with a larger dependency_id value. quality_level is used to indicate FGS layer hierarchy. At any temporal location and with identical dependency_id value, an FGS picture with quality_level value equal to QL uses the FGS picture or base quality picture (i.e., the non-FGS picture when QL−1=0) with quality_level value equal to QL−1 for inter-layer prediction.
In single-loop decoding of scalable video including at least two CGS or spatial scalable layers, only a portion of a coded picture in a lower layer is used for prediction of the corresponding coded picture in a higher layer (i.e. for inter-layer prediction). Therefore, if a sender knows the scalable layer desired for playback in the receivers, the bitrate used for transmission could be reduced by omitting those portions that are not used for inter-layer prediction and not in any of the scalable layers desired for playback. It should be noted that, in the case of a multicast or broadcast, where different clients may desire different layers for playback, these layers are called desired layers.
The bitstream format of SVC includes signaling of simple_priority_id in each network abstraction layer (NAL) unit header of SVC. This enables signaling of one adaptation path for the SVC bitstream. In addition, the adaptation of SVC bitstreams can be done along dependency_id, quality_level, and temporal_level or any combination of these and simple_priority_id. However, simple_priority_id is capable of representing only one partition of SVC bitstreams to adaptation paths. Other adaptation partitions, based upon different optimization criteria, could be equally well-computed, but no means to associate these adaptation partitions to the SVC bitstream exist.
The file format is an important element in the chain of multimedia content production, manipulation, transmission and consumption. There is a difference between the coding format and the file format. The coding format relates to the action of a specific coding algorithm that codes the content information into a bitstream. The file format refers to organizing the generated bitstream in such a way that it can be accessed for local decoding and playback, transferred as a file, or streamed, all utilizing a variety of storage and transport architectures. Further, the file format can facilitate the interchange and editing of the media. For example, many streaming applications require a pre-encoded bitstream on a server to be accompanied by metadata (stored in “hint-tracks”) that assists the server to stream the media to a client. Examples of hint-track metadata include timing information, indication of synchronization points, and packetization hints. This information is used to reduce the operational load of the server and to maximize the end-user experience.
Available media file format standards include the ISO base media file format (ISO/IEC 14496-12), MPEG-4 file format (ISO/IEC 14496-14), AVC file format (ISO/IEC 14496-15) and 3GPP file format (3GPP TS 26.244). There is also a project in MPEG for development of the SVC file format, which will become an amendment to AVC file format. The MPEG-4, AVC, 3GPP, and SVC file formats are all derivatives of the ISO base media file format, i.e. they share the same basic syntax structure. Consequently, they are largely compatible with each other.
ISO base media file format is an object-oriented file format, where the data is encapsulated into structures called ‘boxes’. In all derivative file formats of the ISO base media file format, the media data is stored in a media data box MDAT and the meta data is stored in a movie box MOOV. The media data comprises the actual media samples. It may comprise for example interleaved, time-ordered video and audio frames. Each media has its own metadata box TRAK in the MOOV box that describes the media content properties. Additional boxes in the MOOV box may comprise information about file properties, file content, etc.
The SVC file format is becoming an extension to the AVC file format. The SVC file format handles the storage, extraction and scalability provisioning of the scalable video stream efficiently. The size of a file containing a scalable bit stream should be as small as possible, while still allowing for lightweight extraction of NAL units belonging to different layers. This requires avoiding redundant storage of multiple representations of the media data and an efficient representation of metadata. There are two primary mechanisms utilized to organize an SVC file. First, a grouping concept, i.e., the sample group structure in the ISO base media file format, can be used to indicate the relation of pictures and scalable layers. Second, several tracks referencing subsets of the bitstream can be defined, each corresponding to a particular combination of scalability layers that form a playback point.
FIG. 1 depicts how the SVC media data is stored in a file. Each access unit comprises one sample. A number of samples form a chunk. Practical content normally comprises many chunks. File readers typically read and process one chunk at a time. If the layering structure desired for playback does not require all of the access units (for temporal scalability) and/or all of the pictures in each required access unit (for other types of scalability), then the unwanted access units and/or pictures can be discarded. It is most efficient to perform a discarding operation at the picture level. However, because each sample comprises one access unit, a sample-level grouping is not optimal. On the other hand, if each picture were defined as one sample, then the definition of each sample being the media data corresponding to a certain presentation time in the ISO base media file format would be broken.
In the latest draft SVC file format, the word ‘tier’ is used to describe a layer. Each NAL unit is associated with a group ID, and a number of group ID values are mapped to a tier, identified by a tier ID. This way, given a tier ID, the associated NAL units can be found. The scalability information, including bitrate, spatial resolution, frame rate, and so on, of each tier is signalled in the data structure ScalableTierEntry( ).
Timed metadata tracks, introduced in Amendment 1 of ISO base media file format, contain samples that describe associated media or hint tracks. Different sample formats for the timed metadata track can be specified, and the format used in the timed metadata track can be identified from the reference to a particular sample entry syntax, identified by a four-character code. The samples of the timed metadata track are associated with timestamps and are therefore associated to samples of the corresponding timestamp in the referred track.
Draft Amendment 2 of the ISO base media file format contains three main features to extend the ISO base media file format. First, it specifies structures that help in delivering files stored in the meta box of a ISO base media file over file delivery protocols such as ALC and FLUTE. In particular, the amendment provides functionality to store pre-computed FEC encodings of files and to define hint tracks with server instructions facilitating encapsulation of files into ALC/FLUTE packets. Second, Amendment 2 specifies a method to provide time-dependent information on target ratios between scalable or alternative streams that are supposed to share a common bandwidth resource. This information is referred to as the combined rate scaling information. Third, the amendment also specifies how to include additional meta boxes that carry alternative and/or complementary information to a meta box in a file.
The combined rate scaling information in draft ISO base media file format Amendment 2 is based on two fundamental assumptions:
1. It is assumed that the total bitrate of a channel through which combined media (e.g., audio and video) should be conveyed is limited to a certain constant, or is a piece-wise constant function of time. However, rather than indicating an optimal audio-video bitrate share for a certain total bitrate, certain applications would benefit from an indication of an adaptation path resulting in stable audio-visual quality or experience. For example, if statistical multiplexing is used in broadcast applications, then the bitrate of an individual audiovisual service is allowed to vary in order to maintain a stable quality. At the same time, the total bitrate across all audiovisual services for a multiplex should remain unchanged. Traditionally, rate share information to maintain a stable quality cannot be indicated.
2. Only the target bitrate share between tracks is given. However, no hints or “cookbook” instructions as to how to obtain the indicated target bitrate share by adaptation are given. Consequently, since there are many possibilities for adapting scalable media, e.g., frame rate scaling or quality scaling, the result of the adaptation process in different implementations can greatly differ. Therefore, the value of the combined rate scaling information of the draft ISO base media file format Amendment 2 is diminished.
As described above, SVC utilizes single-loop decoding, i.e. reference pictures are decoded only for the highest decoded layer. Consequently, switching between layers at arbitrary locations is not possible, as the reference pictures of the layer to be switched have not been decoded. The presence of a layer switching point can be concluded from SVC NAL unit headers, but no mechanism exists in conventional systems to indicate switching points in the SVC file format structures. Furthermore, a coded video sequence remains valid if SVC NAL units above a certain threshold simple_priority_id are removed. However, no guarantee as to stream validity is given if the simple_priority_id threshold is changed in the middle of a coded video sequence (i.e. between IDR access units).