A current project of the Joint Video Team (JVT) of the ISO/IEC Moving Pictures Experts Group (MPEG) and the ITU-T Video Coding Experts Group (VCEG) is the development of a scalable extension of the state-of-the-art video coding standard H.264/MPEG4-AVC defined in T. Wiegand, G. J. Sullivan, J. Reichel, H. Schwarz and M. Wien, eds., “Scalable Video Coding—Joint Draft 7, “Joint Video Team, Doc. JVT-T201, Klagenfurt, Germany, July 2006 and J. Reichel, H. Schwarz, and M. Wien, eds., “Joint Scalable Video Model JSVM-7,” Joint Video Team, Doc. JVT-T202, Klagenfurt, Germany, July 2006, supports temporal, spatial and SNR scalable coding of video sequences or any combination thereof.
H.264/MPEG4-AVC as described in ITU-T Rec. & ISO/IEC 14496-10 AVC, “Advanced Video Coding for Generic Audiovisual Services,” version 3, 2005, specifies a hybrid video codec in which macroblock prediction signals are either generated in the temporal domain by motion-compensated prediction, or in the spatial domain by intra prediction, and both predictions are followed by residual coding. H.264/MPEG4-AVC coding without the scalability extension is referred to as single-layer H.264/MPEG4-AVC coding. Rate-distortion performance comparable to single-layer H.264/MPEG4-AVC means that the same visual reproduction quality is typically achieved at 10% bit-rate. Given the above, scalability is considered as a functionality for removal of parts of the bit-stream while achieving an R-D performance at any supported spatial, temporal or SNR resolution that is comparable to single-layer H.264/MPEG4-AVC coding at that particular resolution.
The basic design of the scalable video coding (SVC) can be classified as a layered video codec. In each layer, the basic concepts of motion-compensated prediction and intra prediction are employed as in H.264/MPEG4-AVC. However, additional inter-layer prediction mechanisms have been integrated in order to exploit the redundancy between several spatial or SNR layers. SNR scalability is basically achieved by residual quantization, while for spatial scalability, a combination of motion-compensated prediction and oversampled pyramid decomposition is employed. The temporal scalability approach of H.264/MPEG4-AVC is maintained.
In general, the coder structure depends on the scalability space that may be used in an application. For illustration, FIG. 3 shows a typical coder structure 900 with two spatial layers 902a, 902b. In each layer, an independent hierarchical motion-compensated prediction structure 904a,b with layer-specific motion parameters 906a, b is employed. The redundancy between consecutive layers 902a,b is exploited by inter-layer prediction concepts 908 that include prediction mechanisms for motion parameters 906a,b as well as texture data 910a,b. A base representation 912a,b of the input pictures 914a,b of each layer 902a,b is obtained by transform coding 916a,b similar to that of H.264/MPEG4-AVC, the corresponding NAL units (NAL—Network Abstraction Layer) contain motion information and texture data; the NAL units of the base representation of the lowest layer, i.e. 912a, are compatible with single-layer H.264/MPEG4-AVC. The reconstruction quality of the base representations can be improved by an additional coding 918a,b of so-called progressive refinement slices; the corresponding NAL units can be arbitrarily truncated in order to support fine granular quality scalability (FGS) or flexible bit-rate adaptation.
The resulting bit-streams output by the base layer coding 916a,b and the progressive SNR refinement texture coding 918a,b of the respective layers 902a,b, respectively, are multiplexed by a multiplexer 920 in order to result in the scalable bit-stream 922. This bit-stream 922 is scalable in time, space and SNR quality.
Summarizing, in accordance with the above scalable extension of the Video Coding Standard H.264/MPEG4-AVC, the temporal scalability is provided by using a hierarchical prediction structure. For this hierarchical prediction structure, the one of single-layer H.264/MPEG4-AVC standards may be used without any changes. For spatial and SNR scalability, additional tools have to be added to the single-layer H.264/MPEG4.AVC. All three scalability types can be combined in order to generate a bit-stream that supports a large degree on combined scalability.
For SNR scalability, coarse-grain scalability (CGS) and fine-granular scalability (FGS) are distinguished. With CGS, only selected SNR scalability layers are supported and the coding efficiency is optimized for coarse rate graduations as factor 1.5-2 from one layer to the next. FGS enables the truncation of NAL units at any arbitrary and eventually byte-aligned point. NAL units represent bit packets, which are serially aligned in order to represent the scalable bit-stream 922 output by multiplexer 920.
In order to support fine-granular SNR scalability, so-called progressive refinement (PR) slices have been introduced. Progressive refinement slices contain refinement information for refining the reconstruction quality available for that slice from the base layer bit-stream 912a,b, respectively. Even more precise, each NAL unit for a PR slice represents a refinement signal that corresponds to a bisection of a quantization step size (QP decrease of 6). These signals are represented in a way that only a single inverse transform has to be performed for each transform block at the decoder side. In other words, the refinement signal represented by a PR NAL unit refines the transformation coefficients of transform blocks into which a current picture of the video has been separated. At the decoder side, this refinement signal may be used to refine the transformation coefficients within the base layer bit-stream before performing the inverse transform in order to reconstruct the texture of prediction residual used for reconstructing the actual picture by use of a spatial and/or temporal prediction, such as by means of motion compensation.
The progressive refinement NAL units can be truncated at any arbitrary point, so that the quality of the SNR base layer can be improved in a fine granular way. Therefore, the coding order of transform coefficient levels has been modified. Instead of scanning the transform coefficients macroblock-by-macroblock, as it is done in (normal) slices, the transform coefficient blocks are scanned in separate paths and in each path, only a few coding symbols for a transform coefficient block are coded. With the exception of the modified coding order, the CABAC entropy coding as specified in H.264/MPEG4-AVC is re-used.
The single-layer H.264/MPEG4-AVC coding standard has been developed for the use of a fixed sampling structure among possible chroma sampling structures, such as, for example, 4:2:0 and 4:2:2, respectively. The different chroma sampling capabilities are included in different profiles of the standard. In this regard, reference is made to Marpe, Wiegand, Sullivan: “The H.264/MPEG4 Advanced Video Coding Standard and its applications”, IEEE Communication Magazine, August 2006, p. 134-143. In 4:2:0, for example, the chroma or coloring sampling content indicating the extent to which the color deviates from gray and being defined by two chroma components amounts to, regarding the sampling points, merely one fourth of the number of samples of the luma content representing brightness and being defined by one luma component. In other words, the number of luma component samples in both the horizontal and vertical dimensions is half the number of luma samples. The coding precision used per sample is fixed to be 8 bits or 10 bits, depending on the profile of the standard used. Again, reference is made to the just mentioned article. For sake of completeness, it is noted that the term luma, according to the standard, actually means a weighted sum of non-linear or gamma-corrected RGB contributions. However, according to another view, luma may be viewed as luminance which refers to the linear relationship of the RGB contributions. According to the present application, both views shall equally apply.
In general, the term chroma sampling format refers to the number and position of the chroma samples relative to the number and position of the corresponding luma samples. Three examples of possible sampling formats are described now. As has already been described, according to the 4:2:0 sampling, the chroma signal has half the horizontal and half the vertical resolution as compared to the luma signal. The format is illustrated in FIG. 4, where the crosses indicate the locations of the luma samples, whereas the circles represent the locations of the chroma samples, where each chroma sample may consist of two chroma components, such as Cb and Cr. Another sampling format is 4:2:2, where the chroma signal has half the horizontal and the same vertical resolution as the luma signal. This is shown in FIG. 5. According to a 4:4:4 chroma sampling format, the chroma signal has the same horizontal and vertical resolution as the luma signal or content, respectively. This is illustrated in FIG. 6.
Problems arise when a color video source signal has a different dynamic range and/or a different chroma sampling format than may be used by the decoder or player, respectively. In the above current SVC working draft, the scalability tools are only specified for the case that both the base layer and enhancement layer represent a given video source with the same bit depth of the corresponding arrays of luma and chroma samples, and in addition with the assumption that the chroma sampling relative to the luma sampling, i.e., the chroma sampling format, is fixed for base and enhancement layer(s). Hence, considering different decoders and players, respectively, requiring different bit depths and chroma sampling formats, several coding streams dedicated for each of the bit depths and chroma sampling format requirements would have to be provided separately. However, in rate/distortion sense, this means an increased overhead and reduced efficiency, respectively.
Thus, it would be desirable to provide a coding scheme that overcomes this deficiency.