Video compression using scalable techniques in the sense used herein allows a digital video signal to be represented in the form of multiple layers and/or in multiple views for multiview environments such as stereoscopic view environments. Scalable video coding techniques have been proposed and/or standardized since at least 1993.
In the following, an enhancement layer and a reference layer are distinguished. Information from the reference layer, be it reconstructed samples or meta information such as block coding modes or motion vectors, can be used for prediction of the enhancement layer through inter-layer prediction. The base layer is a special case of a reference layer in that it does not itself have another reference layer from which it is inter-layer predicted. Herein, the term “layer” can interchangeably be used with “view” (in multiview coding) or “depth map”. Therefore, there are reference views, enhancement views, and so on. Henceforth, only layers are described; however the disclosed subject matter can equally apply to views, depth maps, and similar structures.
ITU-T Rec. H.262, entitled “Information technology—Generic coding of moving pictures and associated audio information: Video”, version February/2000, (available from International Telecommunication Union (ITU), Place des Nations, 1211 Geneva 20, Switzerland, and incorporated herein by reference in its entirety), also known as MPEG-2, for example, includes in some of its profiles a scalable coding technique that allows the coding of one base and one or more enhancement layers. The enhancement layers can enhance the base layer in terms of temporal resolution such as increased frame rate (temporal scalability), spatial resolution (spatial scalability), or quality at a given frame rate and resolution (quality scalability, also known as SNR scalability).
ITU Rec. H.263 version 2 (1998) and later (available from International Telecommunication Union (ITU), Place des Nations, 1211 Geneva 20, Switzerland, and incorporated herein by reference in their entirety) also includes scalability mechanisms allowing temporal, spatial, and SNR scalability. Specifically, an SNR enhancement layer according to H.263 Annex O is a representation of what H.263 calls the “coding error”, which is calculated between the reconstructed image of the base layer and the source image. An H.263 spatial enhancement layer can be decoded from similar information, except that the base layer reconstructed image has been upsampled before calculating the coding error, using an interpolation filter. In case of SNR scalability, H.263 Annex O requires that the base layer picture and the enhancement layer picture have exactly the same dimension measured in samples. For spatial scalability, H.263 Annex O requires that the resolution of the enhancement layer is exactly a factor of two of the base layer in each dimension. No provision for disparate picture sizes in (upsampled in case of spatial scalability) base and enhancement layers pictures have been specified, and H.263 requires that picture sizes of reference layer and enhancement layer have to be identical for SNR scalability, or a factor of two in each dimension for spatial scalability.
ITU-T Rec. H.264 version 2 (2005) and later (available from International Telecommunication Union (ITU), Place des Nations, 1211 Geneva 20, Switzerland, and incorporated herein by reference in their entirety), and their respective ISO-IEC counterpart ISO/IEC 14496 Part 10 include scalability mechanisms known as Scalable Video Coding or SVC, in their respective Annex G. Again, the scalability mechanisms of H.264 and Annex G include temporal, spatial, and SNR scalability (among others such as medium granularity scalability). From version 4 (2009) onwards, ITU-T Rec. H.264 (and its ISO/IEC counterpart) also include annex H entitled “Multiview Video Coding” (MVC). According to MVC, a video bitstream can include multiple “views”. One view of a coded bitstream can be a coded representation of a video signal representing the same scene as other views in the same coded bitstream. Views can be predicted from each other. In MVC, one or more reference views can be used to code another view. MVC uses multi-loop decoding. During decoding, the reference view(s) are first decoded, and then included in reference picture buffer and assigned values in the reference picture list when decoding the current view.
In SVC, in contrast to H.263 Annex O, it is possible that a given to-be-reconstructed enhancement layer sample does not have a corresponding base layer sample from which it can be predicted. Referring to FIG. 1, one (of many) examples where this situation can occur is when, using SNR scalability, a reference layer in 4:3 format (101) is augmented by an enhancement layer in 16:9 format (102). The side bars (103) (104) of the 16:9 picture lack base layer information from which they can be predicted. Note that the use of SNR scalability and 16:9 vs. 4:3 picture sizes are but one example of where the disclosed subject matter may be applicable.
For each region in the to-be-reconstructed enhancement layer (EL) picture, if a corresponding region exists in the reference layer (RL) picture, then the coded information of the RL is used to differentially code the EL layer information. A region in the sense used above can be a single sample of the given layer, or multiple samples of the layer.
In H.264 SVC intra coding of the EL, the RL's decoded samples can be used to derive residual samples in the EL. In another example, the RL's motion information can be used to differentially code the motion information of the EL. In another example, the RL's intra prediction direction can be used when coding the intra prediction direction of the EL.
If the corresponding RL information, i.e., decoded samples, coding mode, intra prediction direction, and motion vectors, is not available, inter-layer prediction is usually disabled as described below in more detail. The term “available” refers to a corresponding sample not being part of the reference layer picture. Referring to FIG. 1, the samples that constitute the sidebars (103) (104) are not available in this sense.
As a result, a standard compliant SVC encoder and decoder checks the availability of the corresponding RL information for each region in the EL, and utilize specialized handling of non-available RL information as described below.
In SVC, the enhancement layer (EL) region that corresponds to the coded reference layer (RL) picture is defined as a scaled RL picture region. FIG. 2 shows another example. A reference layer picture (201) has a horizontal size of WRL (202) samples, and a vertical size of HRL (203) samples. This reference layer is used for inter layer prediction to an enhancement layer picture (204) using spatial scalability. The EL picture has horizontal and vertical dimensions of w (205) and h (206) samples, respectively. Note that there may, or may not, be any relationship between w and WRL, and h and HRL, respectively.
Inter-layer prediction might be performed only on a part of the samples of the EL picture, specifically of inner rectangle (207) with dimensions wSRL and hSRL. The regions (208) of the EL picture outside of this rectangle (207) but inside the enhancement layer picture (204), are regions (208) of samples that are not available. In SVC, the EL picture and the scaled RL picture's dimensions can be derived as wSRL×hSRL with wSRL=w−oL−oR and hSRL=h−oT−oB, where the values of the offsets (oL, oR, oT, and oB) can be specified in the enhancement layer bitstream. The offsets are shown as having positive values, and in this case, not all regions of the EL picture have corresponding RL information: the clear region (207) correspond to the scaled RL picture and (only) this region of the EL picture have corresponding RL information and can be coded using inter-layer prediction. The shaded region (208) in the EL picture that lies outside of the clear region does not have corresponding RL information, and hence, cannot be coded using inter-layer prediction. When the offsets are all zero, then the entire EL picture corresponds to the entire coded RL picture. When the offsets are negative, then the entire EL picture corresponds to a sub-region within the RL picture.
Once the scaled RL picture's dimensions are derived, the scale factor for width can be defined as sW=wSRL/wRL, where wRL is the width of the RL picture. Similarly, the scale factor for height is defined as sH=hSRL/hRL, where hRL is the height of the RL picture. Then, given the EL picture's sample position (x, y), the corresponding sample position in the RL picture is defined as (xRL, yRL), where xRL=(x−oL)/sW and yRL=(y−oT)/sH.
In SVC, each picture is coded in 16×16 sample blocks called Macroblocks (MBs). Each MB of the EL picture can be optionally coded utilizing inter-layer prediction if and only if there exists corresponding RL information corresponding to all samples of the EL MB. FIG. 3 shows the top left part of an EL picture (301) subdivided into a raster of MBs (302). Clear region (303) of the EL picture shows the scaled RL picture region that corresponds to the coded RL picture. Dark shaded region (304) shows those EL picture macroblocks not covered at all by the upsampled base layer picture, and, therefore, not using inter layer prediction. The medium shaded macroblocks (305) cover the outer edge of the (upsampled) RL picture, depicted here by a dashed line.
Only the MBs (303) depicted in clear that are entirely within the dotted region can be coded with inter-layer prediction. Note that the information of the MBs (305) that partially reside in region of the (upsampled) RL picture, but partly outside of that picture (as indicated by the bordering line of the upsampled RL picture (306)) cannot use inter-layer prediction. In other words, the decision of whether, for example, sample data of the RL picture can be used for inter layer prediction can be done on an (EL) macroblock granularity.
The restrictions described above are specified in SVC by an algorithm known as the “InCropWindow( ) process”. It checks for each EL MB whether the MB is entirely within the scaled RL picture region that has corresponding RL information. Given the MB's top-left sample position (xMB, yMB), the process checks whether the following are all true: xMB>=Floor[(oL+15)/16], xMB<Floor[(oL+wSRL)/16], yMB>=Floor[(oT+15)/16], and yMB<Floor[(oT+hSRL)/16]. The options for applying any inter-layer predictions are included in the enhancement layer bitstream only for the MBs that lie entirely within the scaled RL picture region because other MBs are prohibited from using inter-layer prediction. In that case, some syntax elements are not present in the coded enhancement layer bitstream, including base_mode_flag, residual_prediction_flag, and motion_prediction_flag. Consequently, the “InCropWindow( ) process” is executed for each MB in order to decide whether inter-layer prediction related signals need to be parsed.
SVC's mechanisms that address disparate RL and EL picture sizes, as described above, require not only the invocation of a complex process like the InCropWindow process frequently, i.e. on a per macroblock basis. Parsing of the macroblock syntax is conditional based upon this process. Also, certain samples may not benefit from inter-layer prediction even if relevant reference layer samples are available, as the decision for availability is made at macroblock granularity.
It would be advantageous if the aforementioned shortcomings could be avoided.