The present invention relates to quality-scalable video data streams, their generation and decoding such as the generation and decoding of video data streams obtained by use of block-wise transformation.
The current Joint Video Team “JVT” of the ITU-T Video Coding Experts Group in the ISO/IEC Moving Pictures Expert Group (MPEG) is currently specifying a scalable extension of the H.264/MPEG4-AVC video coding standard. The key feature of the scalable video coding (SVC) in comparison to conventional single layer encoding is that various representations of a video source with different resolutions, frame rates and/or bit-rates are provided inside a single bit stream. A video representation with a specific spatio-temporal resolution and bit-rate can be extracted from a global SVC bit-stream by simple stream manipulations as packet dropping. As an important feature of the SVC design, most components of H.264/MPG4-AVC are used as specified in the standard. This includes the motion-compensated and intra prediction, the transform and entropy coding, the deblocking as well as the NAL unit packetization (NAL=Network Abstraction Layer). The base layer of an SVC bit-stream is generally coded in compliance with the H.264-MPEG4-AVC, and thus each standard conforming H.264-MPEG4-AVC decoder is capable of decoding the base layer representation when it is provided with an SVC bit-stream. New tools are only added for supporting spatial and SNR scalability.
For SNR scalability, coarse-grain/medium-grain scalability (CGS/MGS) and fine-grain scalability (FGS) are distinguished in the current Working Draft. Coarse-grain or medium-grain SNR scalable coding is achieved by using similar concepts as for spatial scalability. The pictures of different SNR layers are independently coded with layer specific motion parameters. However, in order to improve the coding efficiency of the enhanced layers in comparison to simulcast, additional inter-layer prediction mechanisms have been introduced. These prediction mechanisms have been made switchable so that an encoder may freely choose which base layer information should be exploited for an efficient enhancement layer coding. Since the incorporated inter-layer prediction concepts include techniques for motion parameter and residual prediction, the temporal prediction structures of the SNR layers should be temporally aligned for an efficient use of the inter-layer prediction. It should be noted that all NAL units for a time instant form an excess unit and thus have to follow each other inside an SVC bit-stream. The following three inter-layer predication techniques are included in the SVC design.
The first one is called inter-layer motion prediction. In order to employ base-layer motion data for the enhancement layer coding, an additional macroblock mode has been introduced into SNR enhancement layers. The macroblock partitioning is obtained by copying the partitioning of the co-located macroblock in the base layer. The reference picture indices as well as the associated motion vectors are copied from the co-located base layer blocks. Additionally, a motion vector of the base layer can be used as a motion vector predictor for the conventional macroblock modes.
The second technique of redundancy reduction among the various quality layers is called inter-layer residual prediction. The usage of inter-layer residual prediction is signaled by a flag (residual_prediction_flag) that is transmitted for all inter-coded macroblocks. When this flag is true, the base layer signal of the co-located block is used as prediction for the residual signal of the current macroblock, so that only the corresponding difference signal is coded.
Finally, inter-layer intra prediction is used in order to exploit redundancy among the layers. In this intra-macroblock mode, the prediction signal is built by the co-located reconstruction signal of the base layer. For the inter-layer intraprediction it is generally necessitated that base layers are completely decoded including the computationally complex operations of motion-compensation prediction and deblocking. However, it has been shown that this problem can be circumvented when the inter-layer intra prediction is restricted to those parts of the lower layer picture that are intra-coded. With this restriction, each supported target layer can be decoded with a single motion compensation loop. This single-loop decoding mode is mandatory in the scalable H.264-MPEG4-AVC extension.
Since inter-layer intraprediction can only be applied when the co-located macroblock is intra-coded and the inter-layer motion prediction with inferring the macroblock type can be only applied when the base layer macroblock is inter-coded, both modes are signaled via a single syntax element base_mode_flag on a macroblock level. When this flag is equal to 1, inter-layer intraprediction is chosen when the base layer macroblock is intra-coded. Otherwise, the macroblock mode as well as the reference indices and motion vectors are copied from the base layer macroblock.
In order to support a finer granularity than CGS/MGS coding, so-called progressive refinement slices have been introduced which enable finer granular SNR scalable coding (FGS). Each progressive refinement slice represents a refinement of the residual signal that corresponds to a bisection of the quantization steps size (QP increase of 6). These signals are represented in a way that only a single inverse transform has to be performed for each transform block at the decoder side. The ordering of transform coefficient levels in progressive refinements slices allows the corresponding NAL units to be truncated at any arbitrary byte-aligned point, so that the quality of the SNR base layer can be refined in a fine-granular way. In addition to a refinement of the residual signal, it is also possible to transmit a refinement of motion parameters as part of the progressive refinement slices.
One drawback of the FGS coding in the current SVC draft is that it significantly increases the decoder complexity in comparison to CGS/MGS coding. On the one side the transform coefficients in a progressive refinement slice are coded using several scans over the transform blocks, and in each scan only a few transform coefficient levels are transmitted. For the decoder this increases the complexity since a higher memory bandwidth is needed, because all transform coefficient levels from different scans need to be collected before the inverse transform can be carried out. On the other side, the parsing process for progressive refinement slices is dependent on the syntax elements of the corresponding base layer slices. The order of syntax elements as well as the codeword tables for VLC coding or the probability model selection for arithmetic coding depend on the syntax elements in the base layer. This further increases the memory bandwidth for decoding, since the syntax elements of the base layer need to be accessed during the parsing of the enhancement layer.
Furthermore, the special property of progressive refinement slices that they can be truncated is difficult to use in today's packet switch networks. Usually, a media aware network device will either deliver or drop a packet of a scalable bit-stream. And the only error that will be visible at the application layer is a packet loss.
Therefore, not only in view of the above H.264-MPEG4-AVC but also with other video compression techniques, it would be desirable to have a coding scheme that is better adapted to the today's needs showing packet loss rather than byte-wise truncation problems.