This section is intended to provide a background or context to the invention recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
The present invention relates to scalable video encoding and decoding. In particular, the present invention relates to providing an enhanced reference picture management solution for scalable video coding.
Video coding standards include ITU-T H.261, ISO/IEC MPEG-1 Visual, ITU-T H.262 or ISO/IEC MPEG-2 Visual, ITU-T H.263, ISO/IEC MPEG-4 Visual and ITU-T H.264 (also know as ISO/IEC MPEG-4 AVC). In addition, there are currently efforts underway with regards to the development of new video coding standards. One such standard under development is the scalable video coding (SVC) standard, which will become the scalable extension to H.264/AVC. Another such effort involves the development of China video coding standards. One such video coding standard is: Joint Video Team, “Joint Draft 5: Scalable Video Coding”, Jan. 2006, available from http://ftp3.itu.ch/av-arch/jvt-site/2006—01_Bangkok/JVT-R201.zip.
SVC can provide scalable video bitstreams. A portion of a scalable video bitstream can be extracted and decoded with a degraded playback visual quality. A scalable video bitstream contains a non-scalable base layer and one or more enhancement layers. An enhancement layer may enhance the temporal resolution (i.e. the frame rate), the spatial resolution, or simply the quality of the video content represented by the lower layer or part thereof. In some cases, data of an enhancement layer can be truncated after a certain location, even at arbitrary positions, and each truncation position can include some additional data representing increasingly enhanced visual quality. Such scalability is referred to as fine-grained (granularity) scalability (FGS). In contrast to FGS, the scalability provided by a quality enhancement layer that does not provide fined-grained scalability is referred as coarse-grained scalability (CGS). Base layers can be designed to be FGS scalable as well; however, no current video compression standard or draft standard implements this concept.
The mechanism to provide temporal scalability in the current SVC specification—herein referred to as the hierarchical B pictures coding structure—is not more than what is in AVC. This feature is fully supported by AVC and the signalling part can be done using the sub-sequence related supplemental enhancement information (SEI) messages.
For mechanisms to provide spatial and CGS scalabilities, the conventional layered coding technique similar to earlier standards is used with inter-layer prediction methods. Data that could be inter-layer predicted data includes intra texture, motion and residual data. Single-loop decoding is enabled by a constrained intra texture prediction mode, whereby the inter-layer intra texture prediction can be applied to macroblocks (MBs) for which the corresponding block of the base layer is located inside intra MBs, and at the same time those intra MBs in the base layer use constrained intra prediction. In single-loop decoding, the decoder needs to perform motion compensation and full picture reconstruction only for the scalable layer desired for playback (called the desired layer), hence the decoding complexity is greatly reduced. All the layers other than the desired layer do not need to be fully decoded because all or part of the data of the MBs not used for inter-layer prediction (be it inter-layer intra texture prediction, inter-layer motion prediction or inter-layer residual prediction) is not needed for reconstruction of the desired layer.
The spatial scalability has been generalized to enable the base layer to be a cropped and zoomed version of the enhancement layer. The quantization and entropy coding modules were adjusted to provide FGS capability. The coding mode is called as progressive refinement, wherein successive refinements of the transform coefficients are encoded by repeatedly decreasing the quantization step size and applying a “cyclical” entropy coding akin to sub-bitplane coding.
The scalable layer structure in the current draft SVC standard is characterized by three variables, referred to as temporal_level, dependency_id and quality_level, that are signalled in the bit stream or can be derived according to the specification. The temporal_level variable is used to indicate the temporal hierarchy or frame rate.
A layer comprising pictures of a smaller temporal_level value has a smaller frame rate than a layer comprising pictures of a larger temporal_level. dependency_id is used to indicate the inter-layer coding dependency hierarchy. At any temporal location, a picture of a smaller dependency_id value may be used for inter-layer prediction for coding of a picture with a larger dependency_id value. The quality_level variable is used to indicate FGS layer hierarchy. At any temporal location and with identical dependency_id value, an FGS picture with quality_level value equal to QL uses the FGS picture or base quality picture (i.e., the non-FGS picture when QL−1=0) with quality_level value equal to QL-1 for inter-layer prediction. For more information on SVC, see: S. Wenger, Y.-K. Wang, and M. M. Hannuksela, “RTP payload format for H.264/SVC Scalable Video Coding,” submitted for Packet Video Workshop, April 2006.
Decoded pictures used for predicting subsequent coded pictures are stored in the decoded picture buffer (DPB). To efficiently utilize the buffer memory, the DPB management processes, including the storage of decoded pictures into the DPB, the marking of reference pictures, output and removal of decoded pictures from the DPB, are specified.
SVC includes the coding of key pictures for which the syntax element nal_ref_idc is equal to 3. Herein an access unit containing key pictures is referred as a key access unit. Key access units typically form the lowest temporal resolution, i.e. they typically belong to the temporal hierarchy with temporal_level equal to 0.
For a key access unit, if the desired scalable layer for playback has quality_level larger than 0, i.e. the target playback picture is a FGS picture, then two representations of the access unit will be stored in the DPB for predicting subsequence pictures. One representation corresponds to the decoded picture with dependency_id equal to the desired value (i.e. DependencyIdmax according to the SVC specification) and quality_level equal to 0. This representation is referred to as the base representation. The other representation corresponds to the decoded picture of the desired layer (with dependency_id equal to DependencyIdmax and quality_level equal to the desired value, or in other words, the value of dOiDX is equal to dOiDXmax according to the SVC specification). This representation is referred to as the enhanced representation.
For non-key access units, only one representation, the decoded picture with dOiDX is equal to dOiDXmax, may be stored in the DPB.
According to the SVC specification, decoding of any key access unit always uses only the representations of earlier decoded key access units for inter prediction and does not use decoded pictures of non-key access units for inter prediction. Decoding of non-key access units only uses the enhanced representations of key access units whenever they are available (if not, use their base representations) and decoded pictures of other non-key access units for inter prediction.
In SVC, the marking of the base representation and the enhanced representation of a key access unit is done at the same time. When the enhanced representation is stored in the DPB, the base representation is also stored in the DPB. When the enhanced representation is marked as “used for short-term reference”, the base representation is marked as “used for short-term reference” and as “base representation”. When the enhanced representation is marked as “used for long-term reference” and assigned a value of LongTermFrameIdx, the base representation is marked as “used for long-term reference” and as “base representation” and is assigned the same value of LongTermFrameIdx. When the enhanced representation is marked as “unused for reference”, the base representation is also marked as “unused for reference”.
When fine granular scalability (FGS) is used in SVC and the desired layer for decoding and playback is an FGS layer, then for each so-called key picture two decoded representations of the access unit are stored in the decoded picture buffer for predicting subsequence pictures. One representation, base representation or base key picture, corresponds to the decoded picture with dependency_id equal to the desired value and quality_level equal to 0. The other representation corresponds to the decoded picture of the desired layer. Due to the synchronized reference picture marking process of base representations and enhanced representations of key access units in SVC, some reference pictures stored in the DPB may still be marked as “used for short-term reference” or “used for long-term reference” when it actually becomes not needed any more for inter prediction reference. Consequently, a considerable amount of memory remains occupied unnecessarily.