High Efficiency Video Coding (HEVC) is a new video coding standard currently being developed in Joint Collaborative Team-Video Coding (JCT-VC). JCT-VC is a collaborative project between MPEG and ITU-T. Currently, a Committee Draft (CD) is defined that includes a number of new tools and is considerably more efficient than H.264/AVC.
A picture coded/decoded according to HEVC is partitioned into one or more slices, where each slice is an independently decodable segment of the picture. This means that if a slice is missing, for instance got lost during transmission, the other slices of that picture can still be decoded correctly. In order to make slices independent, they are self contained and do not depend on each other which imply that no bitstream element of another slice is required for decoding any element of a given slice.
Each slice contains a slice header which provides data for the slice to be independently decodable. One example of a data element present in the slice header is the slice address, which is used for the decoder to know the spatial location of the slice. There are many more data elements in the slice header.
HEVC uses previously decoded pictures for encoding and decoding a current picture. These previously decoded pictures are referred to as reference pictures. The encoder indicates to the decoder which reference pictures that are allowed to be used for decoding in a reference picture set (RPS). The previously decoded pictures are stored in a decoded picture buffer (DPB) and the RPS indicates which pictures in the DPB that should be kept, i.e. which are allowed to be used as reference pictures and which pictures in the DPB that should be discarded, i.e. never be used for reference again. It should be noted that the encoder contains a copy of the decoder's DPB.
FIG. 1 illustrates a simplified scenario, where the pictures are distinguished by a picture order count (POC). In this case the RPS indicates POC 1 and POC 2 which implies that the reference pictures identified by POC 1 and POC2 should be kept in the DPBs and the pictures identified by POC 3 should be discarded unless it should be outputted for display.
The reference picture set (RPS) for each picture consists of five different lists (not shown in FIG. 1) of reference pictures, also referred to as the five RPS subsets: RefPicSetStCurrBefore consists of all short-term reference pictures that are prior to the current picture in both decoding order and output order, and that are available for inter prediction of the current picture. RefPicSetStCurrAfter consists of all short-term reference pictures that are prior to the current picture in decoding order, that succeed the current picture in output order, and that are made available for inter prediction of the current picture. RefPicSetStFoll consists of all short-term reference pictures that are available for inter prediction of one or more of the pictures following the current picture in decoding order, and that are unavailable for inter prediction of the current picture. RefPicSetLtCurr consists of all long-term reference pictures that are available for inter prediction of the current picture. RefPicSetLtFoll consists of all long-term reference pictures that are available for inter prediction of one or more of the pictures following the current picture in decoding order, and that are unavailable for inter prediction of the current picture.
The RPS controls which pictures that are possible to put in the reference picture list. However, a picture that is put in the reference picture list may or may not be used for inter prediction but for a picture to be used for inter prediction (used for reference) for the current picture it is required that the picture is included in a reference picture list, in one of the lists denoted Curr. In short, RPS controls, which reference pictures to keep in the DPB and which reference pictures to discard from the DPB. Thus, pictures that are listed in the RPS (any of the 5 RPS lists) are kept and pictures that does not listed are discarded. Then RPS also controls which pictures that are possible to put in the reference picture lists L0 and L1. The pictures in the Foll lists can not be put in L0 and L1. L0 and L1 may only contain some of the pictures from Curr lists. Finally, the encoder chooses what reference pictures from L0 and L1 to use for inter prediction for each block. It may for example choose to use one reference picture for all blocks, even if L0 and L1 contains many pictures.
The HEVC draft specification specifies that each picture shall belong to a temporal layer and that a syntax element called temporal_id shall be present for each picture in the bitstream, corresponding to the temporal layer the picture belongs to.
The temporal layers are ordered and have the property that a picture of a lower temporal layer never references a picture of a higher temporal layer. Thus, higher temporal layers can be removed without affecting the lower temporal layers. The removal of temporal layers can be referred to as temporal scaling. Removal of layers can be done in an entity that is neither an encoder nor a decoder, such as a network node. Such an entity can, but is not limited to, forward video bitstream packets from an encoder to a decoder and perform removal of temporal layers without performing full video decoding on the incoming data.
The resulting bitstream after one or more temporal layers have been removed is called a subsequence. In HEVC it is possible to signal that a picture is a temporal layer switching point, which indicates that at this picture it is possible for a decoder to start decoding more temporal layers than what was decoded before the switching point. The switching point indication guarantees that no picture following the switching point references a picture from before the switching point that might not have been decoded because it belongs to a higher temporal layer than what was decoded before the switching point. The switching points are therefore very useful for a layer removal entity in order to know when to stop removing a certain temporal layer and start forwarding it.
An example is shown in FIG. 2 where vertical axis represents temporal layer and the horizontal axis represent output order. The numbers in the pictures represent decoding order. The arrows represent inter prediction.
Temporal switching can be performed at any point except at picture P6 (to picture P7) since P7 uses P5 for inter prediction.
HEVC contains four different picture types: instant decoder refresh (IDR), clean random access (CRA), temporal layer access (TLA) and regular pictures (non-IDR, non-CRA and non-TLA).
IDR and CRA pictures must have temporal_id equal to 0. TLA pictures must have temporal_id greater than 0.
The TLA picture type is used to define a temporal layer switching point and is currently defined as:
temporal layer access (TLA) picture: A coded picture for which each slice has nal_unit_type equal to 3; the TLA picture and all coded pictures with temporal_id greater than or equal to the temporal_id of the TLA picture that follow the TLA picture in decoding order shall not use inter prediction from any picture with temporal_id greater than or equal to the temporal_id of the TLA picture that precedes the TLA picture in decoding order.
According to the current HEVC specification, it is allowed to include a reference picture from the same or higher temporal layer in the reference picture set of a TLA picture. It is also allowed to include a reference picture from the same temporal layer in the reference picture lists of the TLA picture as long as it is not used for inter prediction. However if a media-aware network element (MANE) performs temporal layer switching at that point, the reference picture in the RPS from the same temporal layer as the TLA picture would not be in the DPB.
However, it is specified for the RPS that:
When the first coded picture in the bitstream is an IDR picture or the current coded picture is not a leading picture of the first coded picture in the bitstream, there shall be no entry in RefPicSetStCurrBefore, RefPicSetStCurrAfter or RefPicSetLtCurr that is equal to “no reference picture”.
An entry being equal to “no reference picture” means that the picture is not present in the DPB.
Thus this requirement on the RPS would violated which means that with the current HEVC specification it is possible to use the TLA picture type even though it is not possible to perform valid temporal layer switching.