This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
There are a number of video coding standards including ITU-T H.261, ISO/IEC MPEG-1 Visual, ITU-T H.262 or ISO/IEC MPEG-2 Visual, ITU-T H.263, ISO/IEC MPEG-4 Visual and ITU-T H.264 or ISO/IEC MPEG-4 AVC. H.264/AVC is the work output of a Joint Video Team (JVT) of ITU-T Video Coding Experts Group (VCEG) and ISO/IEC MPEG. There are also proprietary solutions for video coding (e.g. VC-1, also known as SMPTE standard 421M, based on Microsoft's Windows Media Video version 9), as well as national standardization initiatives, for example AVS codec by Audio and Video Coding Standard Workgroup in China. Some of these standards already specify a scalable extension, e.g. MPEG-2 visual and MPEG-4 visual. For H.264/AVC, the scalable video coding extension SVC, sometimes also referred to as SVC standard, is currently under development.
The latest draft of the SVC is described in JVT-T201, “Joint Draft 7 of SVC Amendment,” 20th JVT Meeting, Klagenfurt, Austria, July 2006, available from http://ftp3.itu.ch/av-arch/jvt-site/2006—07_Klagenfurt/JVT-T201.zip.
SVC can provide scalable video bitstreams. A portion of a scalable video bitstream can be extracted and decoded with a degraded playback visual quality. A scalable video bitstream contains a non-scalable base layer and one or more enhancement layers. An enhancement layer may enhance the temporal resolution (i.e. the frame rate), the spatial resolution, or simply the quality of the video content represented by the lower layer or part thereof. In some cases, data of an enhancement layer can be truncated after a certain location, even at arbitrary positions, and each truncation position can include some additional data representing increasingly enhanced visual quality. Such scalability is referred to as fine-grained (granularity) scalability (FGS). In contrast to FGS, the scalability provided by a quality enhancement layer that does not provide fined-grained scalability is referred as coarse-grained scalability (CGS). Base layers can be designed to be FGS scalable as well.
The mechanism for providing temporal scalability in the latest SVC specification is referred to as the “hierarchical B pictures” coding structure. This feature is fully supported by Advanced Video Coding (AVC), and the signaling portion can be performed by using sub-sequence-related supplemental enhancement information (SEI) messages.
The mechanism for providing temporal scalability in the latest SVC specification is referred to as the “hierarchical B pictures” coding structure. This feature is fully supported by AVC, and the signaling portion can be performed by using sub-sequence-related supplemental enhancement information (SEI) messages.
For mechanisms to provide spatial and CGS scalabilities, a conventional layered coding technique similar to that used in earlier standards is used with some new inter-layer prediction methods. Data that could be inter-layer predicted includes intra texture, motion and residual data. Single-loop decoding is enabled by a constrained intra texture prediction mode, whereby the inter-layer intra texture prediction can be applied to macroblocks (MBs) for which the corresponding block of the base layer is located inside intra MBs. At the same time, those intra MBs in the base layer use constrained intra prediction. In single-loop decoding, the decoder needs to perform motion compensation and full picture reconstruction only for the scalable layer desired for playback (called the desired layer). For this reason, the decoding complexity is greatly reduced. All of the layers other than the desired layer do not need to be fully decoded because all or part of the data of the MBs not used for inter-layer prediction (be it inter-layer intra texture prediction, inter-layer motion prediction or inter-layer residual prediction) are not needed for reconstruction of the desired layer.
The spatial scalability has been generalized to enable the base layer to be a cropped and zoomed version of the enhancement layer. The quantization and entropy coding modules were adjusted to provide FGS capability. The coding mode is referred to as progressive refinement, wherein successive refinements of the transform coefficients are encoded by repeatedly decreasing the quantization step size and applying a “cyclical” entropy coding akin to sub-bitplane coding.
The scalable layer structure in the current draft SVC standard is characterized by three variables, referred to as temporal_level, dependency_id and quality_level, that are signaled in the bit stream or can be derived according to the specification. temporal_level is used to indicate the temporal layer hierarchy or frame rate. A layer comprising pictures of a smaller temporal_level value has a smaller frame rate than a layer comprising pictures of a larger temporal_level. dependency_id is used to indicate the inter-layer coding dependency hierarchy. At any temporal location, a picture of a smaller dependency_id value may be used for inter-layer prediction for coding of a picture with a larger dependency_id value. quality_level is used to indicate FGS layer hierarchy. At any temporal location and with identical dependency_id value, an FGS picture with quality_level value equal to QL uses the FGS picture or base quality picture (i.e., the non-FGS picture when QL-1=0) with quality_level value equal to QL-1 for inter-layer prediction.
In single-loop decoding of scalable video including at least two CGS or spatial scalable layers, only a portion of a coded picture in a lower layer is used for prediction of the corresponding coded picture in a higher layer (i.e. for inter-layer prediction). Therefore, if a sender knows the scalable layer desired for playback in the receivers, the bitrate used for transmission could be reduced by omitting those portions that are not used for inter-layer prediction and not in any of the scalable layers desired for playback. It should be noted that, in the case of a multicast or broadcast, where different clients may desire different layers for playback, these layers are called desired layers.
The Joint Video Team (JVT) is currently working on the development of the SVC standard. The JVT-R050r1 (“Discardable bits and Multi-layer RD estimation for Single loop decoding,” 18th Meeting: Bangkok, Thailand, 14-20 Jan., 2006, available from http://ftp3.itu.ch/av-arch/jvt-site/2006—01_Bangkok/JVT-R050.zip) and JVT-R064 (“Selective Inter-layer Prediction,” 18th Meeting: Bangkok, Thailand, 14-20 Jan., 2006, available from http://ftp3.itu.ch/av-arch/jvt-site/2006—01_Bangkok/JVT-R064.zip) contributions previously attempted to utilize “unneeded data” to improve the performance of SVC in certain application scenarios. JVT-R150r1 briefly proposed that discardable residuals be coded in a separate Network Abstraction Layer (NAL) unit or slice with the NAL discardable_flag set, where the discardable_flag indicated that a NAL unit is not required for decoding upper layers. However, only residual data is mentioned and it was not specified how to encode those “discardable” residuals to a separate NAL unit or slice. According to the current SVC design, this is impossible unless those MBs having residual data not required for inter-layer prediction are consecutive in raster scan order, which is not likely. JVT-R064 proposed to force all of the MBs to not be used for inter-layer prediction for a set of pictures (i.e., each coded as one slice) in certain layers of high temporal levels. A frame-based selective inter-layer prediction method has been proposed in JVT-S051 (“Frame Based Selective Inter-layer Prediction,” 19th Meeting: Geneva, CH, 31 Mar.-7 Apr., 2006, available from http://ftp3.itu.ch/av-arch/jvt-site/2006—04_Geneva/JVT-S051.zip), wherein for certain pictures (each coded as one slice), all the MBs in of the pictures are forced not to be used for inter-layer prediction. The selection of the certain pictures is modeled as a knapsack problem and solved using dynamic programming. U.S. Provisional Patent Application 60/786,496 to Applicant and JVT-S039 (“On discardable lower layer adaptations,” 19th Meeting: Geneva, CH, 31 Mar.-7 Apr., 2006, available from http://ftp3.itu.ch/av-arch/jvt-site/2006—04_Geneva/JVT-S039.zip), hereinafter incorporated in their entirety, proposed using slice groups and/or data partitioning to separate data needed for inter-layer prediction (non-discardable data) and data not needed for inter-layer prediction (discardable data), such that the discaradable data can be discarded to avoid unnecessary transmission and/or decoding.
In SVC, if spatial scalability is provided, a high layer MB can exploit inter-layer prediction using scaled base layer motion data when either a base mode_flag or a base_mode_refinement_flag is equal to 1. In this scenario, a high layer MB is reconstructed with default motion data deduced from the base layer. For example, if a base layer is of QCIF size and an enhancement layer is of CIF size, the motion vector of one block in the base layer will be scaled by 2 and upsampled to 2×2 motion vectors for the four co-located blocks in the enhancement layer.
For inter-layer texture prediction, if spatial resolution between the enhancement and base layers pictures is different, an interpolation filter is needed to upsample the base layer. Before applying the interpolation filter, the intra MBs of the base layer are extended by a 4-sample border in each direction using a border extension process. Before performing the border extension, a deblocking filter is applied to all boundaries inside an intra MB or between the intra MBs.
In inter-layer residual prediction of SVC, if a previous layer represents a layer with half the spatial resolution of the current layer, the residual signal is upsampled using a separable bi-linear filter before it is used as prediction signal. For inter-layer spatial resolution ratios different from 1 and 2, the interpolation process is based on a quarter-pel interpolation process as specified in AVC.
Assuming two layers exist, i.e., a lower layer and a higher layer, in current SVC, it is possible to mark a coded slice NAL unit in the low layer as discardable and that the discardable slice need not be present when decoding the higher layer. Therefore, the higher layer decoding must not depend on any data conveyed in the discardable slice, even if the discardable slice is present. This requirement can be met when each picture is coded as one slice, where the base_id_plus 1 of the higher layer slice/picture above a discardable lower layer slice/picture is set to 0. However, when a lower layer picture is coded into more than one slice and some of the slices are discardable while others not, problems arise in ensuring that the above requirement is met:
A first problem arises when a slice in a high layer picture covers regions covered by both discardable and non-discardable slices in the low layer. For each of the MBs covering regions covered by discardable slices in the lower layer, all the instances of syntax elements base_mode_flag, base_mode_refinement_flag, intra_base_flag, motion prediction flag—10[ ], motion prediction flag_l1[ ],and residual_prediction_flag must be set to equal to 0. However, these syntax elements are still transmitted in the bitstream which results in reduced coding efficiency as compared to a case when these syntax elements are not transmitted for the MBs.
A second problem arises when the higher layer is a spatial scalable layer. The decoding process involves upsampling processes for samples or residual values to lower layer pictures before those values are used for inter-layer prediction. However, the upsampling result may become unpredictable for those MBs neighboring the discardable MBs due to the non-initialized values of the discardable MBs. Consequently, it is difficult to ensure that the decoding result is correct.