The present invention is concerned with picture and/or video coding, and in particular, quality-scalable coding enabling bit-depth scalability using quality-scalable data streams.
The Joint Video Team (JVT) of the ISO/IEC Moving Pictures Experts Group (MPEG) and the ITU-T Video Coding Experts Group (VCEG) have recently finalized a scalable extension of the state-of-the-art video coding standard H.264/AVC called Scalable Video Coding (SVC). SVC supports temporal, spatial and SNR scalable coding of video sequences or any combination thereof.
H.264/AVC as described in ITU-T Rec. & ISO/IEC 14496-10 AVC, “Advanced Video Coding for Generic Audiovisual Services,” version 3, 2005, specifies a hybrid video codec in which macroblock prediction signals are either generated in the temporal domain by motion-compensated prediction, or in the spatial domain by intra prediction, and both predictions are followed by residual coding. H.264/AVC coding without the scalability extension is referred to as single-layer H.264/AVC coding. Rate-distortion performance comparable to single-layer H.264/AVC means that the same visual reproduction quality is typically achieved at 10% bit-rate. Given the above, scalability is considered as a functionality for removal of parts of the bit-stream while achieving an R-D performance at any supported spatial, temporal or SNR resolution that is comparable to single-layer H.264/AVC coding at that particular resolution.
The basic design of the scalable video coding (SVC) can be classified as a layered video codec. In each layer, the basic concepts of motion-compensated prediction and intra prediction are employed as in H.264/AVC. However, additional inter-layer prediction mechanisms have been integrated in order to exploit the redundancy between several spatial or SNR layers. SNR scalability is basically achieved by residual quantization, while for spatial scalability, a combination of motion-compensated prediction and oversampled pyramid decomposition is employed. The temporal scalability approach of H.264/AVC is maintained.
In general, the coder structure depends on the scalability space that is necessitated by an application. For illustration, FIG. 8 shows a typical coder structure 900 with two spatial layers 902a, 902b. In each layer, an independent hierarchical motion-compensated prediction structure 904a,b with layerspecific motion parameters 906a, b is employed. The redundancy between consecutive layers 902a,b is exploited by inter-layer prediction concepts 908 that include prediction mechanisms for motion parameters 906a,b as well as texture data 910a,b. A base representation 912a,b of the input pictures 914a,b of each layer 902a,b is obtained by transform coding 916a,b similar to that of H.264/AVC, the corresponding NAL units (NAL—Network Abstraction Layer) contain motion information and texture data; the NAL units of the base representation of the lowest layer, i.e. 912a, are compatible with single-layer H.264/AVC.
The resulting bit-streams output by the base layer coding 916a,b and the progressive SNR refinement texture coding 918a,b of the respective layers 902a,b, respectively, are multiplexed by a multiplexer 920 in order to result in the scalable bit-stream 922. This bit-stream 922 is scalable in time, space and SNR quality.
Summarizing, in accordance with the above scalable extension of the Video Coding Standard H.264/AVC, the temporal scalability is provided by using a hierarchical prediction structure. For this hierarchical prediction structure, the one of single-layer H.264/AVC standards may be used without any changes. For spatial and SNR scalability, additional tools have to be added to the single-layer H.264/MPEG4.AVC as described in the SVC extension of H.264/AVC. All three scalability types can be combined in order to generate a bit-stream that supports a large degree on combined scalability.
Problems arise when a video source signal has a different dynamic range than necessitated by the decoder or player, respectively. In the above current SVC standard, the scalability tools are only specified for the case that both the base layer and enhancement layer represent a given video source with the same bit depth of the corresponding arrays of luma and/or chroma samples. Hence, considering different decoders and players, respectively, requiring different bit depths, several coding streams dedicated for each of the bit depths would have to be provided separately. However, in rate/distortion sense, this means an increased overhead and reduced efficiency, respectively.
There have already been proposals to add a scalability in terms of bit-depth to the SVC Standard. For example, Shan Liu et al. describe in the input document to the JVT—namely JVTX075—the possibility to derive a an inter-layer prediction from a lower bit-depth representation of a base layer by use of an inverse tone mapping according to which an inter-layer predicted or inversely tone-mapped pixel value p′ is calculated from a base layer pixel value pb by p′=pb·scale+offset with stating that the inter-layer prediction would be performed on macro blocks or smaller block sizes. In JVT-Y067 Shan Liu, presents results for this inter-layer prediction scheme. Similarly, Andrew Segall et al. propose in JVT-X071 an inter-layer prediction for bit-depth scalability according to which a gain plus offset operation is used for the inverse tone mapping. The gain parameters are indexed and transmitted in the enhancement layer bit-stream on a block-by-block basis. The signaling of the scale factors and offset factors is accomplished by a combination of prediction and refinement. Further, it is described that high level syntax supports coarser granularities than the transmission on a block-by-block basis. Reference is also made to Andrew Segall “Scalable Coding of High Dynamic Range Video” in ICIP 2007, I-1 to 1-4 and the JVT document, JVT-X067 and JVT-W113, also stemming from Andrew Segall.
Although the above-mentioned proposals for using an inverse tone-mapping in order to obtain a prediction from a lower bit-depth base layer, remove some of the redundancy between the lower bit-depth information and the higher bit-depth information, it would be favorable to achieve an even better efficiency in providing such a bit-depth scalable bit-stream, especially in the sense of rate/distortion performance.