This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
Scalable coding produces scalable media bitstreams, where a bitstream can be coded in multiple layers and each layer together with the required lower layers is one representation of the media sequence at a certain spatial resolution or temporal resolution or at a certain quality level or some combination of the three. A portion of a scalable bitstream can be extracted and decoded at a desired spatial resolution or temporal resolution or a certain quality level or some combination of the three. A scalable bitstream contains a non-scalable base layer and one or more enhancement layers. An enhancement layer may enhance the temporal resolution (i.e., the frame rate), the spatial resolution, or simply the quality of the video content represented by a lower layer or part thereof. In some cases, data in an enhancement layer can be truncated after a certain location, or even at arbitrary positions, where each truncation position may include additional data representing increasingly enhanced visual quality. The latest SVC specification is described in JVT-T201, “Joint Draft 7 of SVC Amendment,” 20th JVT Meeting, Klagenfurt, Austria, July 2006 (hereinafter “H.264/AVC”).
In some cases of SVC, data in an enhancement layer can be truncated after a certain location, or at arbitrary positions, where each truncation position may include additional data representing increasingly enhanced visual quality. Such scalability is referred to as fine-grained (granularity) scalability (FGS). In contrast to FGS, the scalability provided by those enhancement layers that cannot be truncated is referred to as coarse-grained (granularity) scalability (CGS). It collectively includes the traditional quality (SNR) scalability and spatial scalability. Hereafter in this document, it is assumed that there are only CGS layers, though obviously the methods can be extended to the cases when FGS layers are also available.
For SVC single loop decoding, pictures of only the highest decoding layer are fully decoded. Therefore, as shown in FIG. 4, the current SVC specification maintains only one Decoded Picture Buffer (DPB) for the layer targeted for playback. Accordingly, a reference picture list is only constructed for the target layer. For example, for lower layers even though the memory management control operation (MMCO) and reference picture list reordering (RPLR) commands are signaled in slice headers, the decoding process ignores them.
As shown in FIG. 5, when inter-layer motion prediction is used for the current MB, the base layer motion vector and reference index are used to predict the motion vector and reference index of the current MB. The reference index signaled in the base-layer macroblock (“MB”) is relative to the reference picture list of the base-layer. However, there is no decoding process specified in the current SVC specification for the derivation of the reference picture list of the base-layer coded pictures. Instead, the reference picture list of the target layer is used for the base layer when needed. Consequently, when the reference picture list of the base layer is different from the target layer, information (e.g. motion) from a wrong reference picture of the base layer may be used.
This problem may specifically occur when temporal direct mode or spatial direct mode prediction is used. For example, assume that the current MB is using inter-layer motion prediction. The collocated MB in the lower layer picture uses temporal direct mode. To obtain the motion information of the collocated lower layer MB, motion information of a lower layer picture from an earlier decoded access unit is needed. In this case, if the list position of that lower layer picture in the reference picture list of the lower layer is different from the list position of the target-layer picture having the same index in the reference picture list of the target layer, a wrong motion would be referred. Consequently, the current MB, hence the current picture of the target layer would be decoded incorrectly.
Accordingly, there is a need for a system and method for maintaining reference picture list for lower layers when decoding a SVC bitstream containing more than one scalable layer to ensure correct decoding when direct prediction modes are used for coding of the lower layers.