With video coding technologies, it is often desired to compress a video sequence into a coded video sequence. The video sequence may for example have been captured by a video camera. A purpose of compressing the video sequence is to reduce a size, e.g. in bits, of the video sequence. In this manner, the coded video sequence will require smaller memory when stored and/or less bandwidth when transmitted from e.g. the video camera. A so called encoder is often used to perform compression, or encoding, of the video sequence. Hence, the video camera may comprise the encoder. The coded video sequence may be transmitted from the video camera to a display device, such as a television set (TV) or the like. In order for the TV to be able to decompress, or decode, the coded video sequence, it may comprise a so called decoder. This means that the decoder is used to decode the received coded video sequence. In other scenarios, the encoder may be comprised in a radio base station of a cellular communication system and the decoder may be comprised in a wireless device, such as a cellular phone or the like, and vice versa.
A known video coding technology is called High Efficiency Video Coding (HEVC), which is a new video coding standard, currently being developed by Joint Collaborative Team-Video Coding (JCT-VC). JCT-VC is a collaborative project between Moving Pictures Expert Group (MPEG) and International Telecommunication Union's Telecommunication Standardization Sector (ITU-T).
HEVC is a hybrid codec that use multiple reference pictures for inter prediction. HEVC includes a picture marking process in which reference pictures can be marked as “used for short-term reference”, “used for long-term reference” and “unused for reference”. If marked “unused for reference”, the picture cannot be used for inter prediction any more. The marking process in HEVC is controlled by Reference Picture Sets (RPSs). An RPS is a set of picture identifiers that identifies reference pictures. The set is sent in each slice and reference pictures will be kept in the Decoded Picture Buffer (DPB) if they are present in the RPS. A slice is a spatially distinct region of a frame that is encoded independently from any other region in the same frame. The RPS part of the slice segment header syntax is shown in Table 1.
Pictures in HEVC are identified by their Picture Order Count (POC) values, also known as full POC values. Each slice contains a code word, pic_order_cnt_lsb, that shall be the same for all slices in a picture. pic_order_cnt_lsb is also known as the least significant bits (lsb) of the full POC since is it a fixed-length code word and only the least significant bits of the full POC is signaled. Both encoder and decoder keep track of POC wrap-around so that full POC values can be assigned to each picture that is encoded/decoded.
Short-term pictures are indicated in the RPS through a pair of numbers, the POC of the reference picture and a flag: used_by_curr_pic_lx_flag. The decoder knows the POC of the reference pictures in the DPB and can match those against the POC values received in the RPS. The flag used_by_curr_pic_lx_flag indicates whether the reference picture is used for reference for the current picture or not.
Long-term pictures are generally indicated in the RPS by the least significant bit (lsb) part of the POC value of the reference picture. However, the HEVC standard has an optional codeword, delta_poc_msb_cycle_lt_minus1, which provides an alternative way of referencing long-term pictures. The long-term picture part of the HEVC slice header syntax is shown at the end of Table 1.
TABLE 1RPS slice header syntaxDescriptorslice_segment_header( ) {... if( !IdrPicFlag ) {  pic_order_cnt_lsbu(v)  short_term_ref_pic_set_sps_flagu(1)  if( !short_term_ref_pic_set_sps_flag )   short_term_ref_pic_set( num_short_term_ref_pic_sets )  else   short_term_ref_pic_set_idxu(v)  if( long_term_ref_pics_present_flag ) {   if( num_long_term_ref_pics_sps > 0 )    num_long_term_spsue(v)   num_long_term_picsue(v)   for( i = 0; i < num_long_term_sps + num_long_term_pics; i++ ) {    if( i < num_long_term_sps )     lt_idx_sps[ i ]u(v)    else {     poc_lsb_lt[ i ]u(v)     used_by_curr_pic_lt_flag[ i ]u(1)    }    delta_poc_msb_present_flag[ i ]u(1)    if( delta_poc_msb_present_flag[ i ] )     delta_poc_msb_cycle_lt[ i ]ue(v)   }  }
If delta_poc_msb_present_flag is equal to 0, the long-term picture is indicated by the lsb part of its POC only. If delta_poc_msb_present_flag is equal to 1, the long-term picture is indicated by the full POC, i.e. the lsb part of POC and a POC msb cycle used to calculate the msb part of POC. delta_poc_msb_present_flag equal to 1 allows for two long-term pictures to share the same POC lsb. The HEVC standard currently mandates that delta_poc_msb_present_flag shall be equal to 1 whenever there are at least two reference pictures in the DPB with the same POC lsb. This is restricted by the following sentence in the draft HEVC specification, where DeltaPocLt is a list that holds all POC lsb of the long-term pictures in the RPS:
delta_poc_msb_present_flag[i] shall be equal to 1 when there is more than one reference picture in the decoded picture buffer with picture order count modulo MaxPicOrderCntLsb equal to PocLsbLt[i].
This restriction says that when a long-term picture is being indicated by an RPS and there are more than one reference pictures in the decoded picture buffer with the same POC lsb as that long-term picture, the long-term picture indication shall include the signaling of the POC msb cycle, i.e. delta_poc_msb_present_flag shall be equal to 1 for that long-term picture indication in the RPS.
An HEVC bitstream consists of one or more Coded Video Sequences (CVS). A coded video sequence starts, in decoding order, with a first picture that has a picture type that does not use any other picture for prediction and for which all pictures that are present in the DPB are marked “unused for reference” so that no picture in a CVS uses pictures in another CVS for reference. A CVS consists of a series of access units that are sequential in a NAL unit stream, see below, and use only one Sequence Parameter Set (SPS). SPS is defined as a special type of NAL unit, e.g. SPS_NUT. The SPS contains information that is valid for an entire coded video sequence such as picture size or cropping window parameters that are applied to pictures when they are output from the decoder.
HEVC defines temporal sub-layers. For each picture the variable TemporalId, calculated from the syntax element nuh_temporal_id_plus1, in the NAL unit header, indicates which temporal sub-layer the picture belongs to. A lower temporal sub-layer cannot depend on a higher temporal sub-layer and a sub-bitstream extraction process requires that when one or more of the highest temporal sub-layers are removed from a bitstream the remaining bitstream shall be a conforming bitstream. As an example, lower temporal sub-layers may be associated with a display rate, or bit rate, that is lower than a display rate, or a bit rate, corresponding to a higher temporal sub-layer. It shall be understood that temporal sub-layers enable sub-bitstream extraction by only looking at NAL unit headers, it is not necessary to decode other parts of the bitstream.
In HEVC, the encoded pictures are encapsulated in one or more Network Abstraction Layer (NAL) units, forming part of an access unit. NAL units are classified as Video Coding Layer (VCL) units or non-VCL NAL units, such as the above mentioned SPS, according to whether they contain coded picture samples or contain other associated data, respectively. In the HEVC standard, all VCL NAL units of the same picture are required to have the same NAL unit type, which indicates properties about the encoded picture and may affect the decoding process. The NAL unit types; TRAIL_N, TSA_N, STSA_N, RASL_N and RADL_N are used to indicate that this picture is not used for reference by any picture of the same temporal sub-layer. In this text those pictures are referred to as Non-Reference Temporal Sub-Layer (NRTSL) pictures. The NAL unit types RSV_VCL_N10, RSV_VCL_N12, or RSV_VCL_N14 are reserved for use in future versions of the HEVC specification but they are already now required to have the properties of NRTSL pictures and may thus be considered to be NRTSL pictures even though it is not yet allowed to use these NAL unit types in conforming bit streams. All other picture types are in this text referred to as Reference Temporal Sub-Layer (RTSL) pictures.
For a NRTSL picture X, if the temporal sub-layer that X belongs to is the highest temporal sub-layer that is decoded it will be possible to remove X, i.e. all the NAL units carrying the picture X, from the bitstream without affecting decodability of the remaining stream. However, the DPB might contain different pictures when the picture Y that follows X in decoding order is to be decoded depending on if X was decoded or discarded. It might be the case that, when a picture X is removed, the DPB could contain two long-term reference pictures in the DPB with the same POC lsb when Y is decoded but this would not have been the case if X would have been received. Therefore, the encoder may have used delta_poc_msb_present_flag equal to 0 for a case for which that is ok when X is present in the bitstream but violates the constraint previously mentioned when X has been removed. The decoding process for this case is undefined. Thus, for this case it is not possible to remove X from the bitstream without affecting decodability of the remaining stream.
The same situation can occur when removing individual pictures from higher temporal sub-layers.
Consider the following example:
8 bits are used for pic_order_cnt_lsb. This means that POC_lsb values are in the range of 0 to 255, inclusive. The POC of the picture X is 257 and pictures with POC 0 and 256 are both long-term pictures that are present in the DPB marked as “used for long-term reference”. Both of these long-term reference pictures will have POC_lsb equal to 0. Assume that only the picture with POC 0 is present in the RPS of picture X, i.e. the picture with POC 256 shall be removed from the DPB. HEVC contains the restriction,
delta_poc_msb_present_flag[i] shall be equal to 1 when there is more than one reference picture in the decoded picture buffer with picture order count modulo MaxPicOrderCntLsb equal to PocLsbLt[i].
Thus, picture X must signal delta_poc_msb_present_flag equal to 1 for the picture with POC 0.
When the RPS of picture X has been decoded there will only be one picture in the DPB with POC lsb equal to 0.
Assume that picture Y follows X in decoding order, has POC 258, and indicates in its RPS that the long-term picture with POC equal to 0 shall be kept as a reference picture in the DPB. When Y is decoded there will only be one picture in the DPB with POC lsb equal to 0. Thus, it is not required that delta_poc_msb_present_flag is equal to 1 for that picture in the RPS of picture Y.
If X was a NRTSL picture in the same temporal sub-layer as Y or if X was encoded in a higher temporal sub-layer than Y then it should be possible to remove X without affecting the decodability of Y. However, if picture X is removed, there will be two long-term reference pictures in the DPB with POC lsb equal to 0 and since delta_poc_msb_present_flag is equal to 0 for the long-term reference picture with POC lsb equal to 0 in the RPS of Y, it is not defined in the decoding process which one of these pictures to keep as a reference picture in the DPB. Thus, the remaining bit stream is not decodable and the intent of the NRTSL picture type and temporal layering is broken.