This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
Multimedia applications include local playback services, streaming or on-demand services, conversational services and broadcast/multicast services. Technologies involved in multimedia applications include, among others, media coding, storage and transmission. Different standards have been specified for different technologies.
Video coding standards include ITU-T H.261, ISO/IEC MPEG-1 Visual, ITU-T H.262 or ISO/IEC MPEG-2 Visual, ITU-T H.263, ISO/IEC MPEG-4 Visual and ITU-T H.264 (also know as ISO/IEC MPEG-4 AVC). In addition, there are currently efforts underway with regards to the development of new video coding standards. One such standard under development is the scalable video coding (SVC) standard, which will become the scalable extension to H.264/AVC.
Scalable video coding is a desirable feature for many multimedia applications and services used in systems employing decoders with a wide range of processing power, display size, connecting bandwidth, etc. Several types of video scalability have been proposed, such as temporal, spatial and quality scalability.
A portion of a scalable video bitstream can be extracted and decoded with a degraded playback visual quality. A scalable video bitstream contains a non-scalable base layer and one or more enhancement layers. An enhancement layer may enhance the temporal resolution (i.e., the frame rate), the spatial resolution, or simply the quality of the video content represented by a lower layer or part thereof.
In some cases, data in an enhancement layer can be truncated after a certain location, or even at arbitrary positions, where each truncation position may include additional data representing increasingly enhanced visual quality. Such scalability is referred to as fine-grained (granularity) scalability (FGS). The concept of FGS was first introduced to the MPEG-4 Visual standard and is also part of the SVC standard. In contrast to FGS, coarse-grained scalability (CGS) refers to the scalability provided by a quality enhancement layer that does not provide fined-grained scalability.
The latest draft specification of the SVC is described in JVT-S202, “Joint Scalable Video Model JSVM-6: Joint Draft 6 with proposed changes,” 19th Joint Video Team Meeting, Geneva, Switzerland, April 2006, incorporated herein by reference in its entirety.
SVC employs the mechanism already available in H.264/AVC for temporal scalability. This mechanism is known as a “hierarchical B pictures” coding structure. Therefore, the mechanism used in SVC is also fully supported by H.264/AVC, while signaling can be accomplished by using sub-sequence related supplemental enhancement information (SEI) messages.
For the mechanism that provides CGS scalability in the form of spatial and quality (SNR) scalability, a conventional layered coding technique is used. This technique is similar to techniques used in earlier standards with the exception of new inter-layer prediction methods. Data that could be inter-layer predicted includes intra texture, motion and residual data. Inter-layer motion prediction includes the prediction of block coding mode, header information, etc. In SVC, data can be predicted from layers other than the currently reconstructed layer or the next layer.
SVC includes a relatively new concept known as single-loop decoding. Single-loop decoding is enabled by using a constrained intra texture prediction mode, whereby the inter-layer intra texture prediction can be applied to macroblocks (MBs) for which the corresponding block of the base layer is located inside intra-MBs. At the same time, those intra-MBs in the base layer use the constrained intra prediction. In single-loop decoding, the decoder needs to perform motion compensation and full picture reconstruction only for the scalable layer desired for playback (referred to as the desired layer), thereby greatly reducing decoding complexity. All of the layers other than the desired layer do not need to be fully decoded because all or part of the data of the MBs not used for inter-layer prediction (whether it is inter-layer intra texture prediction, inter-layer motion prediction or inter-layer residual prediction) is not needed for reconstruction of the desired layer.
When compared to older video compression standards, SVC's spatial scalability has been generalized to enable the base layer to be a cropped and zoomed version of the enhancement layer. The quantization and entropy coding modules have also been adjusted to provide FGS capability. The FGS coding mode is referred to as progressive refinement, where successive refinements of the transform coefficients are encoded by repeatedly decreasing the quantization step size and applying a “cyclical” entropy coding akin to sub-bitplane coding.
The scalable layer structure in the current SVC draft is characterized by three variables. These variables are temporal_level, dependency_id and quality_level. The temporal_level variable is used to indicate the temporal scalability or frame rate. A layer comprising pictures of a smaller temporal_level value has a smaller frame rate than a layer comprising pictures of a larger temporal_level. The dependency_id variable is used to indicate the inter-layer coding dependency hierarchy. At any temporal location, a picture of a smaller dependency_id value may be used for inter-layer prediction for coding of a picture with a larger dependency_id value. The quality_level variable is used to indicate FGS layer hierarchy. At any temporal location, and with an identical dependency_id value, an FGS picture with a quality_level value equal to QL uses the FGS picture or base quality picture (i.e., the non-FGS picture when QL-1=0) with a quality_level value equal to QL-1 for inter-layer prediction.
The file format is an important element in the chain of multimedia content production, manipulation, transmission and consumption. There is a difference between the coding format and the file format. The coding format relates to the action of a specific coding algorithm that codes the content information into a bitstream. The file format refers to organizing the generated bitstream in such way that it can be accessed for local decoding and playback, transferred as a file, or streamed, all utilizing a variety of storage and transport architectures. Further, the file format can facilitate interchange and editing of the media. For example, many streaming applications require a pre-encoded bitstream on a server to be accompanied by metadata—stored in the “hint-tracks”—that assists the server to stream the video to the client. Examples for hint-track metadata include timing information, indication of synchronization points, and packetization hints. This information is used to reduce the operational load of the server and to maximize the end-user experience.
Available media file format standards include the ISO file format (ISO/IEC 14496-12), MPEG-4 file format (ISO/IEC 14496-14), AVC file format (ISO/IEC 14496-15) and 3GPP file format (3GPP TS 26.244). There is also a project in MPEG for development of the SVC file format, which will become an amendment to AVC file format.
The SVC file format is becoming an extension to AVC file format. A major problem to solve by the SVC file format is to efficiently handle the storage, extraction and scalability provisioning of the scalable video stream. A number of constraints are observed in the ongoing design phase. First, the size of the file containing a scalable bit stream should be as small as possible, while still allowing for lightweight extraction of NAL units belonging to different layers. This requires avoiding redundant storage of multiple representations of the media data and an efficient representation of metadata. Second, server implementation needs to be sufficiently lightweight, requiring not overly complex metadata design. Both of these two aspects are closely related to the metadata structuring, which consequently has received close attention during the standardization. There are two primary mechanisms to organize an SVC file. First, the grouping concept, i.e., the sample group structure in the ISO base media file format, can be used to indicate the relation of pictures and scalable layers. Second, several tracks referencing to subsets of the bitstream can be defined, each corresponding to a particular combination of scalability layers that form a playback point.
FIG. 1 depicts how the SVC media data is stored in a file. Each access unit comprises one sample. A number of samples form a chunk. Practical content normally comprises many chunks. File readers typically read and process one chunk at a time. If the layering structure desired for playback does not require all of the access units (for temporal scalability) and/or all of the pictures in each required access unit (for other types of scalability), then the unwanted access units and/or pictures can be discarded. It is most efficient to perform a discarding operation at the picture level. However, because each sample comprises one access unit, a sample-level grouping is not optimal. On the other hand, if each picture were defined as one sample, then definition of each sample being the media data corresponding to a certain presentation time in the ISO base media file format would be broken.
In the latest draft SVC file format, the word ‘tier’ is used to describe a layer. Each NAL unit is associated with a group ID, and a number of group ID values are mapped to a tier, identified by a tier ID. This way, given a tier ID, the associated NAL units can be found. The scalability information, including bitrate, spatial resolution, frame rate, and so on, of each tier is signaled in the data structure ScalableTierEntry( ).
In SVC, region-of-interest (ROI) scalability is supported, i.e., the scalable stream could be encoded in a way that data of at least one rectangular sub-region, which is a subset of the entire region represented by a certain layer, can be independently decoded and displayed. Therefore, a user may request only the data for a ROI to be transmitted. Such a ROI is also referred to as a ROI scalable layer or scalable ROI layer.
One way to encode a ROI is to include the blocks covering a ROI into a set of one or more slices in the coded picture. When encoding the set of slices, the coded data is made independent of coded data of the blocks outside the corresponding ROI in any other coded picture. The set of slices may be included in a slice group that covers only the set of slices, or the set of slices may be included a slice group covering more slices.
Interactive ROI (IROI) scalability involves an interaction between the user/receiver and the sender. For example, in streaming of pre-encoded content, a user may freely request different regions for display. To enable this feature, the video content should be encoded into multiple rectangular ROIs. This way, only the coded data of all the ROls covered by requested region are needed to be sent to the user.
To easily obtain the ROI scalability information and extract the required data for a client request, file format-level signaling of ROI information is needed. Without a file format signaling of the ROI information, a file reader has to find and parse the ROI related SEI messages (scalability information SEI messages, sub-picture scalable layer SEI messages, and motion-constrained slice group set SEI messages), and parse into the picture parameter sets and slice headers. Moreover, if the bitstream does not contain the ROI related SEI messages, the file reader has to assume that there is no ROI support in the bitstream or, alternatively, apply an extensively complex analysis to check whether there is ROI support and, if established that the bitstream does support, it has to apply a further extensively complex analysis to obtain the ROI information.
There is therefore a need to have a method for file format level signaling of ROI scalability information.