High Efficiency Video Coding (HEVC, ISO/IEC 23008-2 MPEG-H Part 2/ITU-T H.265) is the current joint video coding standardization project of the ITU-T Video Coding Experts Group (ITU-T Q.6/SG 16) and ISO/IEC Moving Picture Experts Group (ISO/IEC JTC 1/SC 29/WG 11). The core part of HEVC, as well as the Range, Scalable (SHVC) and multiview (MV-HEVC) extensions, are finalized and efforts are directed towards the standardization of the screen content coding (SCC) extension. Each part or extension also defines various profiles, i.e. implicit parameters or limits on them, such as Main, Main10, Scalable Main, Scalable Main 10, 4:4:4 8 bits, and the like.
Many research activities were conducted in the past on the definition of scalable extensions for video compression standards. These researches were mainly motivated by the wish to offer video streams having adaptation capabilities. Indeed, it has been noted that the same video can be used for different purposes, by different clients having different display, decoding, or network capabilities. In order to address these adaptation capabilities, several types of scalability were defined, the most popular being the temporal scalability, the spatial scalability, and the scalability in quality also known as the SNR (Signal to Noise Ratio) scalability. SHVC is an example of such extension defined above the HEVC standard.
A simple approach to encode several versions of same video data consists in encoding independently each version. However, it is well known that better compression performances are obtained by exploiting as much as possible the correlations existing between the different versions. To do so, scalable or multi-view video encoders start by encoding one version of the video that becomes a base or a reference version. This version is self-contained, meaning that it doesn't refer to any other version. The resulting stream representing the base version is in general fully compliant with the core standard, but not only, for instance compliant with HEVC in the case of SHVC and MV-HEVC. The base version may however be compliant with another extension, such as Range Extensions, when it is 4:4:4. Other versions are then encoded predictively with respect to this base version and exploit the correlations. The prediction could be either direct, with a direct dependence on the base version or indirect by referring to an intermediate version encoded between the base version and the current version. The intermediate versions are then a reference version. One can note that the terminology “reference version” can also apply to a base version.
In scalable encoding, the base version is generally called the “base layer” or “reference layer” and provides the lowest quality, and the lowest spatial and temporal resolution. Other versions are called “enhancement layers”. Enhancement layers could enhance the quality, the spatial resolution or the temporal resolution of a base layer.
In the multi-view video coding, the reference version is generally called the main view and the other versions are called the dependent views.
Further improvements of the compression efficiency can be obtained by taking benefit of the encoding choices made in a base or a reference version. Indeed, since images are correlated, similar encoding choices should be taken. As a consequence some syntax elements can be either inferred or predicted from same syntax elements in a reference version. In particular, both SHVC and MV-HEVC use motion information of the base or reference versions to predict motion information of the other versions.
FIG. 1 is a block diagram illustrating an encoder implementing the scalable extension of HEVC as defined in the 3rd working draft (JCTVC-N1008: High efficiency video coding (HEVC) scalable extension draft 3, output document of JCT-VC, 14th meeting, Vienna, AT, 25 Jul.-2 Aug. 2013). As can be seen in FIG. 1, the encoder comprises two stages: a first stage noted 100A for encoding a base layer and a second stage denoted 100B for encoding an enhancement layer. Further stages similar to the second stage could be added to the encoder depending on the number of scalable layers to be encoded.
The first stage 100A aims at encoding an HEVC compliant base layer. The input to this non-scalable stage comprises an original sequence of images, obtained by applying a down-sampling (step 110) to images (105) if the different layers have different spatial resolutions. In a first step, during the encoding, an image is divided into blocks of pixels (step 115A), called coding units (CU) in the HEVC standard. Each block is then processed during a motion estimation operation (step 120A), which comprises a step of searching, among the reference pictures stored in a dedicated image buffer (125A), also called frame or picture buffer, for reference blocks that would provide a good prediction of the block to encode.
This motion estimation step provides one or more reference image indexes representing one or more indexes in the image buffer of images containing the found reference blocks, as well as corresponding motion vectors indicating the position of the reference blocks in the reference images.
Next, during a motion compensation step (130A), the estimated motion vectors are applied to the found reference blocks for computing a temporal residual block which corresponds to the difference between a predictor block, obtained through motion compensation, and the original block to predict.
In parallel or sequentially after the temporal prediction steps, an Intra prediction step (step 135A) is carried out to determine a spatial prediction mode that would provide the best performance to predict the current block. Again, a spatial residual block is computed. In this case, it is computed as being the difference between a spatial predictor computed using pixels in the neighbourhood of the block to encode and the original block to predict.
Afterwards, a coding mode selection mechanism (step 140A) chooses the coding mode to be used, among the spatial and temporal prediction modes, which provides the best rate distortion trade-off in the coding of the current block. Depending on the selected prediction mode, steps of applying a transform of the DCT type (Discrete Cosine Transform) and a quantization (step 145A) to the residual prediction block are carried out. Next, the quantized coefficients (and associated motion data) of the prediction information as well as the mode information are encoded using entropy coding (step 150A). The compressed data 155 associated with the coded current block are then sent to an output buffer.
It is to be noted that HEVC has adopted an improved process for encoding motion information. Indeed, while in the previous video compression standards, motion information was predicted using a predictor corresponding to a median value computed on the spatially neighbouring blocks of the block to encode, in HEVC a competition is performed on predictors corresponding to neighbouring blocks to determine the predictor offering the best rate distortion performances. In addition, motion predictor candidates comprise the motion information related to spatial neighbouring block and to temporally collocated blocks belonging to another encoded image. As a consequence, motion information of previously encoded images need to be stored to allow a prediction of motion information. In the current version of the standard, these information are optionally stored in a compressed form by the encoder and the decoder to limit the memory usage of the encoding and decoding process.
After the current block has been encoded (step 145A), it is reconstructed. To that end, an inverse quantization (also called scaling) and inverse transform step is carried out (step 160A). This step is followed (if needed) by a sum between the inverse transformed residual and the prediction block of the current block in order to form the reconstructed block. The reconstructed image composed of the reconstructed blocks is post filtered (step 165A), e.g. using deblocking and sample adaptive offsets filters of HEVC. The post-filtered reconstructed image is finally stored in the image buffer 125A, also referred to as the DPB (Decoded Picture Buffer), so that it is available for use as a reference picture to predict any subsequent images to be encoded.
The motion information in the DPB associated with this image is stored in a summarized form in order to limit the memory required to store these information. The first step of the summarization process consists in dividing the image in block of size 16×16. Then each 16×16 block is associated with a motion information representative of the original motion of blocks of the encoded image included in this 16×16 blocks.
Finally, an entropy coding step is applied to the coding mode and, in case of an inter CU, to the motion data, as well as the quantized DCT coefficients previously calculated. This entropy coder encodes each of these data into their binary form and encapsulates the so-encoded block into a container called NAL unit (Network Abstract Layer). A NAL unit contains all encoded coding units from a given slice. A coded HEVC bit-stream consists in a series of NAL units.
As can be seen in FIG. 1, the second stage 100B of the scalable encoder is similar to the first stage. Nevertheless, as will be described in greater detail below, high-level changes have been adopted, in particular in the image buffer management 125B. As can be seen, this buffer receives reconstructed images from the base layer, in addition to mode and motion information. An optional intermediate up-sampling step can be added when the two scalable layers have different spatial resolutions (step 170). This information, obtained from the reference layer, is then used by other modules of stage 100B in a way similar to the ones of stage 100A. Steps 115B, 120B, 130B, 135B, 140B, 145B, 150B, 160B, and 165B correspond to steps 115A, 120A, 130A, 135A, 140A, 145A, 150A, 160A, and 165A, described by reference to stage 100A, respectively.
FIG. 2 is a block diagram illustrating an SHVC decoder compliant with a bit-stream such as the one generated by the SHVC encoder illustrated in FIG. 1. The scalable stream to be decoded, denoted 200, is made of a base layer and an enhancement layer that are multiplexed (of course, the scalable stream may comprise more several enhancement layers). The two layers are de-multiplexed (step 205) and provided to their respective decoding stage denoted 210A and 210B.
Stage 210A is in charge of decoding the base layer. In this stage, the base layer bit-stream is first decoded to extract coding units (or blocks) of the base layer. More precisely, an entropy decoding step (step 215A) provides the coding mode, the motion data (reference pictures indexes, motion vectors of INTER coded macroblocks, and direction of prediction for intra prediction), and residual data associated with the blocks. Next, the quantized DCT coefficients constituting the residual data are processed during an inverse quantization operation and an inverse transform operation (step 220A).
Depending on the mode associated with the block being processed (step 225A), a motion compensation step (step 230A) or an Intra prediction step (step 235A) is performed, and the resulting predictor is added to the reconstructed residual obtained in step 220A). Next, a post-filtering step is applied to remove encoding artefacts (step 240A). It corresponds to the filtering step 265A in FIG. 1, performed at the encoder's end.
The so-reconstructed blocks are then gathered in the reconstructed image which is stored in the decoded picture buffer denoted 245A in addition to the motion information associated with the INTER coded blocks.
Stage 210B takes charge of the decoding of the enhancement layer. Similarly to the decoding of the reference layer, a first step of decoding the enhancement layer is directed to entropy decoding of the enhancement layer (step 215B), which provides the coding modes, the motion or intra prediction information, as well as the transformed and quantized residual information of blocks of the enhancement layer.
Next, quantized transformed coefficients are processed in an inverse quantization operation and in an inverse transform operation (step 220B). An INTER or INTRA predictor is then obtained (step 230B or step 235B) depending on the mode as obtained after entropy decoding (step 225B).
In the case where the INTER mode is used to obtain INTER predicted blocks, the motion compensation step to be performed (step 230B) requires the decoding of motion information. To that end, the index of the predictor selected by the encoder is obtained from the bit-stream along with a motion information residual. The motion vector predictor and the motion residual are then combined to obtain the decoded motion information, allowing determination of the INTER predictor to be used. Next, the reconstructed temporal residual is added to the identified INTER predictor to obtain the reconstructed block.
Reconstructed blocks are then gathered in a reconstructed image on which a post-filtering step is applied (step 240B) before storage in the image buffer denoted 245B of the enhancement layer. To be compliant with the encoder, the policy applied by the encoder for the management of the image buffer of the enhancement layer is applied by the decoder. Accordingly, the enhancement layer image buffer receives motion and mode information from the base layer along with reconstructed image data, that are interpolated if necessary (step 250).
As mentioned above, it has been decided during the development of the scalable extension of HEVC to avoid as much as possible the definition of new coding tools specific to the scalable format. As a consequence, the decoding process and the syntax at the coding unit (block) level in an enhancement layer have been preserved and only high-level changes to the HEVC standard introduced.
Inter layer prediction of an image of an enhancement layer is obtained, in particular, through the insertion of information representing the corresponding image of the reference layer in the image buffer (references 125B in FIGS. 1 and 245B in FIG. 2) of the enhancement layer. The inserted information comprises decoded pixel information and motion information. This information can be interpolated when the scalable layers have different spatial resolutions. The references to these images are then inserted at the end of specific reference image lists, depending on the type of the current slice of the enhancement layer.
It is to be recalled that according to HEVC, images are coded as independently decodable slices (i.e. independently decodable strings of CTU (Coding Tree Units)). There exist three types of slices:                intra slices (I) for which only intra prediction is allowed;        predictive slices (P) for which intra prediction is allowed as well as inter prediction from one reference image per block using one motion vector and one reference index; and        bi-predictive slices (B) for which intra prediction is allowed as well as inter prediction from one or two reference images per block using one or two motion vectors and one or two reference indexes.        
A list of reference images is used for decoding predictive and bi-predictive slices. According to HEVC, two reference image lists denoted L0 and L1 are used. L0 list is used for decoding P and B slices while L1 list is used only for decoding B slices. These lists are set up for each slice to be decoded.
In a P slice, the image obtained from a base layer, also called ILR (Inter Layer Reference), is inserted at the end of the L0 list. In a B slice, ILR images are inserted at the end of both the L0 and L1 lists.
By inserting ILR images in the lists, the image of the reference layer corresponding temporally to the image to encode, that may be interpolated (or up-sampled) if needed, becomes a potential reference image that can be used for temporal prediction. Accordingly, blocks of an inter layer reference (ILR) image can be used as predictor blocks in INTER mode.
In HEVC (and all its extensions, including SHVC and SCC extensions), the inter mode (“MODE_INTER”) and intra mode (“MODE_INTRA”) are prediction modes that are signalled in the bit-stream by a syntax element denoted “pred_mode_flag”. This syntax element takes respectively the value 0 and 1 for the inter mode and the intra mode respectively. This syntax element may be absent (e.g. for slices of the intra type where there is no block coded using the inter mode), in which case it is assumed to be 1. In addition, two sets of motion information (also called motion fields) are defined. They correspond to the reference image lists L0 and L1. Indeed, as mentioned above, a block predicted using “MODE_INTER” may use one or two motion vector predictors depending on the type of inter prediction.
Each motion vector predictor is obtained from an image belonging to a reference list. When two motion vector predictors are used to predict the same block (B slices, i.e. bi-predictive coding), the two motion vector predictors belong to two different lists. The syntax element “inter_pred_idc” allows identifying the lists involved in the prediction of a block. The values 0, 1 and 2 respectively mean that the block uses L0 alone, L1 alone, and both. When absent, it can be inferred to be L0 alone, which is the case for slices of P type.
Generally, L0 list of reference images contains images preceding the current image while L1 list contains images following the current image. However, in HEVC preceding and following images can appear in any list.
The motion information (motion field) contained in an INTER block for one list consists in the following information:                an availability flag denoted “predFlagLX” which indicates that no motion information is available when it is equal to 0;        an index denoted “ref_idxLX” for identifying an image in a list of reference images. The value −1 of this index indicates the absence of motion information; and,        a motion vector that has two components: an horizontal motion vector component denoted “mvLX[0]” and a vertical motion vector component denoted “mvLX[1]”. It corresponds to a spatial displacement in terms of pixels between the current block and the temporal predictor block in the reference image.wherein the suffix “LX” of each syntax element takes the value “L0” or “L1”.        
A block of the inter type is therefore associated with two motion fields.
As a consequence, the standard specification implies the following situations:                for a block of the intra type:                    “pred_mode_flag” is set to 1 (MODE_INTRA);            for each of the L0 and L1 lists:                            “predFlagLX” is set to 0;                “refIdxLX” is set to −1; and                “mvLX[0]” and “mvLX[1]” should not be used because of the values of “predFlagLX” and “refIdxLX”.                                                for a block of the inter type using only the L0 list:                    “pred_mode_flag” is set to 0 (MODE_INTER);            L0 list motion information:                            “predFlagL0” is set to 1;                “refIdxL0” indicates a reference image in the L0 list in the DPB;                “mvL0[0]” and “mvL0[1]” are set to the corresponding motion vector values.                                    L1 list motion information:                            “predFlagL1” is set to 0;                “refIdxL1” is set to −1; and                “mvL1[0]” and “mvL1[1]” should not be used because of the values of “predFlagL1” and “refIdxL1”.                                                for a block of the inter type using only the L1 list: motion information is similar to motion information for a block of the inter type using only the L0 list except that L0 and L1 are swapped.        for a block of the inter type using both the L0 and L1 lists (i.e. slices of the B type):                    “pred_mode_flag” is set to 0 (MODE_INTER);            for each of the L0 and L1 lists:                            “predFlagLX” is set to 1;                “refIdxLX” indicates a reference image in the corresponding L0 or L1 list in the DPB;                “mvLX[0]” and “mvLX[1]” are set to the corresponding motion vector values.                                                
As already stated, motion information is coded using a predictive coding in HEVC. One particularity of the prediction of motion information in HEVC is that a plurality of motion information predictors is derived from blocks neighbouring the block to encode and one best predictor is selected in this set, the selection being based on a rate-distortion criterion. Another particularity of the approach adopted by HEVC is that, these derived predictors can comprise motion information from spatially neighbouring blocks but also from temporally neighbouring blocks.
FIG. 3 represents schematically a spatially scalable video sequence compliant with SHVC. For the sake of illustration, it comprises only two layers, for example a reference layer and an enhancement layer, denoted RL and EL. The first layer RL is compliant with HEVC. EL layer uses the same prediction scheme as described in the SHVC draft specifications. As can be seen in FIG. 3, the image of the first layer at time t2, denoted (RL, t2), has been inserted in the image buffer of EL layer after being up-sampled so as to be of the same size as the image of the EL layer. Therefore, this ILR image can used to provide a temporal predictor to the block denoted BEL belonging to the image of the second layer at time t2, denoted (EL, t2). This predictor is identified by motion information comprising a motion vector. For the sake of illustration, the motion vector is equal to (0, 0) since the block to predict and the predictor are collocated.
It is to be noted that a similar concept is used in the MV-HEVC (for multi-views) and 3D-HEVC extensions: instead of an ILR image, with potential resampling, the corresponding images in other views may be added to the reference picture lists in a way similar to the one described by reference to FIG. 3.
SHVC provides a method for deriving motion information of an ILR image to be inserted in the motion part of the decoded picture buffer of an enhancement layer.
FIG. 4 illustrates steps of a method for deriving motion information from two images: one image of the enhancement layer and one image of the reference layer corresponding to an image to be encoded of the enhancement layer.
The process starts when an image of the enhancement layer is to be encoded.
During an initialization step (step 400), the image of the reference layer, denoted refRL, corresponding to the image to be encoded is identified to be stored in the image buffer as the ILR. If necessary, the image refRL is up-sampled (if the reference and enhancement layers have different spatial resolutions) before being stored as the ILR. In addition, during this initialization step, a first block of 16×16 pixels of the ILR image is identified.
Next, the position of the centre of the identified 16×16 block is determined (step 405). The determined centre is used to determine the collocated position in the identified image refRL of the reference layer (step 415). The determined collocated position is used in the following to identify respectively a block bEL of the ILR image and a block bRL of the reference layer image refRL that can provide motion information to the ILR image.
Information representative of the first motion information (motion field corresponding to the first list (L0 or L1)) associated with the identified block bRL is then obtained (step 420).
Then, a first test is performed (step 430) to verify the availability of the bRL block at the collocated position found in step 415. If no block is available at that position, the current 16×16 block of the ILR image is marked as having no motion information in list LX (step 435), for instance by setting the flag “predFlagLX” to 0 and the flag “refIdxLX” to −1. Next, the process proceeds to step 440 which is detailed hereafter.
On the contrary, if it is determined that the bRL block in the reference layer is available at the position collocated with centre (step 430), the mode of the bRL block is identified. If it is determined (step 445) that this mode is “MODE_INTRA”, the ILR motion field is set to have no motion information (step 435) and the process proceeds to step 440.
If the bRL block of the reference layer is not encoded according to the intra mode but using the inter mode (step 445), the current motion field of the current 16×16 block of the ILR image takes the values of the first motion field of the bRL block of the reference image identified in step 415 (steps 450 and 455):                “predFlagLXILR”=“predFlagLXRL”;        “refIdxLXILR”=“refIdxLXRL”;        “mvLXILR[O]”=“mvLXRL[0]”;        “mvLXILR[1]”=“mvLXRL[1]”;wherein X equal to 0 and 1 for list L0 and list L1, respectively, and where “mvLXILR[O]”, “mvLXRL[0]”, “mvLXILR[1]”, and “mvLXRL[1]” represent vector components. It is to be noted that a scaling factor may be applied to the motion vector of the reference layer during step 455 if the reference and enhancement layers have different spatial resolutions.        
Next, a test is carried out to determine whether or not the current field is the last field of the block identified in the image of the reference layer. If the current field is the last field of the block identified in the image of the reference layer, the process proceeds to step 460 that is described hereafter. On the contrary, if the current field is not the last field of the block identified in the image of the reference layer, the second motion field of the block identified in the image of the enhancement layer is obtained (step 465) and the process is branched to step 430 to process the second motion field. It is to be noted that for the second motion field, tests 430 and 445 may be carried out differently (e.g. by using previously stored results of these tests) since this information has already been obtained when processing the first motion field.
Next, if all the blocks of the current image to be encoded have not been processed (step 460), the following 16×16 block is selected (step 490) and the process is repeated.
FIG. 5 illustrates an example of splitting a Coding Tree Block into Coding Units and an exemplary scan order to sequentially process the Coding Units.
It is to be recalled that in the HEVC standard, the block structure is organized by Coding Tree Blocks (CTBs). A picture contains several non-overlapped and square Coding Tree Blocks. The size of a Coding Tree Block can be equal to 64×64 pixels to 16×16 pixels. The size is determined at the sequence level. The most efficient size, in terms of coding efficiency, is the largest one, that is to say 64×64. It is to be noted that all Coding Tree Blocks have the same size except the ones located on the image border (they are arranged in rows). The size of the boundary CTBs is adapted according to the amount of remaining pixels.
Each Coding Tree Block contains one or more square Coding Units (CU). Each Coding Tree Block is split into several Coding Units based on a quad-tree structure. The processing order of each Coding Unit in the Coding Tree Block, for coding or decoding the corresponding CTB, follows the quad-tree structure based on a raster scan order. FIG. 5 shows an example of the processing order of Coding Units generically referenced 500 belonging to one Coding Tree Block 505. The number indicated in each Coding Unit gives the processing order of each corresponding Coding Unit 500 of Coding Tree Block 505.
In HEVC as in the previous standard H.264/AVC, the temporal prediction signal can be weighted by a weight in order, for instance, to better deal with fading or cross-fading images. Another use may be to partially correct mismatch between the colour spaces of an enhancement layer and of the reference layer providing pixel data. Weighted prediction modes are therefore specified to make it possible to weight the predictions based on the reference images. Weighted prediction may be used in uni-prediction (slices of the P type) and bi-prediction (slices of the B type). These modes may apply to any layer in case of scalability.
In HEVC, as in previous standards, in the uni-prediction case, a weighting factor denoted w0 and an offset denoted o0 may be computed from information encoded in the slice header. Conceptually, the prediction signal denoted PRED is defined by the following equation:PRED=MC[REF0,MV0]*w0+o0 
where REF is a reference picture, MV the motion vector and MC the motion compensation operation. Here, rounding aspects are not taken into account.
In HEVC as in previous standards, in the b-prediction case, two weighted factors denoted w0 and w1 and two offsets denoted o0 and o1 are computed from information in the slice header. Conceptually, the prediction signal is defined by the following simplified equation where rounding aspects are not taken into account:PRED=(MC[REF0,MV0]*w0+o0+MC[REF1,MV1]*w1+o1)/2
Turning back to table 1 in the Appendix, signalling of the weighted prediction information is explained. Firstly, it is to be noted that there is a different set of parameters for luma and chroma. It is also to be noted that the weights have fractional precision determined by the denominators denoted luma_log 2 weight_denom and chroma_log 2_weight_denom. For each reference image in the lists L0 and L1, flags luma_weight_IX_flag and chroma_weight_IX_flag (with X being equal to 0 or 1) may be present to signal whether explicit parameters for respectively luma (w0 and w1) and chroma (o0 and o1) are present. If the flags are not present, they are assumed to be 0, meaning that default values for other syntax elements are assumed: a weight of 1 (in fractional representation) and an offset of 0, resulting in the weighted prediction being equal to the prediction of motion compensation. These flags are absent for the current picture when it is used as a reference picture (as can be seen from the check: “if(PicOrderCnt(RefPicList0[i]) !=PicOrderCnt(CurrPic))”.
Although such solutions have proven to be efficient, there is a continuous need for optimizing image encoding and decoding, in order to improve quality and/or efficiency, in particular by making it possible to combine efficient provided tools.