Scalable video coding (SVC) enables the coding and transmission of several representations of the same video sequence within a single bit stream, and the removal of one or more representations from the bit stream when necessary or desired, after coding but before transmission. Each representation corresponds to particular temporal, spatial or fidelity resolutions of the video sequence.
The scalability in the context of digital video coding is useful to provide a graceful degradation of resolution in response to a worsening of a transmission condition, such as a decrease of available bandwidth resources, or a change in the network conditions, such as the presence of congestion, or in order to adapt to the capabilities (such as display resolution, processing power or battery power), needs or preferences of the receiver or of the decoder on the receiver side. The scalability, also called bitstream scalability, enables the discarding of certain parts of the bit stream, when necessary or desired, so that the required bit rate can be adapted after encoding, i.e. without requiring a modification of the encoding process itself. Since encoding may be a computer intensive task, it is useful to encode the bit stream for the highest required resolution, and, then, to be able to remove some parts of the bit stream without having to carry out the encoding again. In other words, SVC allows partial transmission and decoding of the bit stream by sending only some of the representations. Each representation coded in the bit stream (also called SVC bit stream) is referred to as a layer.
The lowest layer is called the base layer and the successive higher layers are called enhancement or enhanced layers. Scalability involves at least the coding of a base layer and an enhancement layer. A plurality of enhancement layers may be provided. For instance, the base layer may represent the video sequence at a low spatial resolution (e.g. QVGA, standing for Quarter VGA, i.e. Quarter Video Graphics Array, and corresponding to a 320×240 resolution) while an enhancement layer may represent the video sequence at a higher spatial resolution (e.g. VGA, standing for Video Graphics Array and usually corresponding to a 640×480 resolution). In general terms, for each image (sometimes called “access unit” when coded) in the original video sequence, an enhancement layer provides a refined representation of the image compared to the representation provided by the base layer.
The scalability in video coding is different from simulcast coding, i.e. independently coding each representation. Generally, SVC should be more efficient than simulcast coding. In SVC, the coding of a layer (except for instance for the base layer, which may be coded independently) should reuse some of the bandwidth, or some of the bits in the bit stream, assigned to another layer.
Video coding often involves predictive coding techniques. These techniques are notably based on the coding of the differences between images or pixels considered in a particular order. The order according to which the images or pixels of the video sequence are processed, i.e. predicted, on the encoding side is generally the same as the order according to which they are reconstructed on the decoding side. For instance, the decoding of some images, which may be called anchor images, does not require making use of previously decoded images. The decoding of other images or pictures, in contrast, requires making use of at least one previously decoded picture, which may be called reference picture. Video coding standards usually do not specify a particular method to be used for coding, but they do specify the decoding methods to be used on the receiver side. Predictive coding techniques may imply the following steps on the coding side.
First, coding parameters, also called predictive coding parameters, such as coding modes and motion vectors, are selected in order to most efficiently reconstruct an image to be coded from one or more previously reconstructed images, pixels or blocks of pixels. These coding parameters are coded in the bit stream for transmission.
Secondly, the selected predictive coding parameters are applied to the images of the video sequence on the coding side. The result of this step constitutes the so-called prediction, i.e. how a given image, pixel or block of pixels would be predicted on the decoding side, from the images, pixels or blocks of pixels previously reconstructed on the decoding side, if only these predictive coding parameters were used to decode the image. In other words, the prediction is a prediction on the coding side of how a given image or part thereof will be predicted on the decoding side. For instance, if the parameters are motion vectors, the prediction is then the so-called motion-compensated prediction.
Thirdly, a residual (or prediction error) is computed by computing the differences between (i.e. by subtracting) the original image (the actual picture) and the result of the prediction based on the predictive coding parameters (the predicted picture). The residual is also coded in the bit stream for transmission (along with the predicted coding parameters, as mentioned above).
On the decoding side, the images, pixels or blocks of pixels of the video sequence are reconstructed in the specified order. The predictive coding parameters are used to predict images, pixels or blocks of pixels from the already reconstructed images, pixels or blocks of pixels of the video sequence, and the residual is then used to correct these predictions.
Types of predictive coding techniques include intra coding and inter coding. Intra coding, or intra-picture coding, uses spatial prediction from spatially neighbouring regions in the same image (i.e. from neighbouring pixels or regions to be reconstructed first on the decoding side). Intra-picture coding takes advantage of the spatial correlation between pixel regions of one image. In contrast, inter coding, or inter-picture coding, uses the temporal prediction from temporally neighbouring images (i.e. from neighbouring images to be reconstructed first on the decoding side). Inter-picture coding takes advantage of the temporal correlation between images. Intra and inter coding may be combined.
In addition to these predictive coding techniques, the inter-layer prediction is proper to SVC. In inter-layer prediction, as much information as possible from a lower representation of the video sequence is used for coding a higher representation of the video sequence. In other words, in order to increase the overall coding efficiency, the redundancy between the layers is taken into account by using information from a coded lower layer to predict a higher layer.
An example of video coding standard providing scalability is the H.264/AVC standard (ITU-T, H.264 (11/2007), Series H: Audiovisual and Multimedia Systems, Infrastructure of audiovisual services—Coding of moving video, Advanced video coding for generic audiovisual services, ITU-T Recommendation H.264, here referred to as “reference [1]”). Its annex G, entitled “Scalable video coding”, discloses examples of SVC techniques. An overview of the technology disclosed on this annex is provided in Schwarz H., Marpe D. and Wiegand T., Overview of the Scalable Video Coding Extension of the H.264/AVC Standard, IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 9, pp. 1103-1120, September 2007 (here referred to as “reference [2]”). Section “I.” and “II.” of reference [2] notably provide explanations on scalability in the context of video coding.
It is desirable to provide methods, encoders and computer programs to improve the efficiency of scalable video coding, without increasing the encoder and decoder complexities as far as possible.