High Efficiency Video Coding (HEVC, ITU-T H.265|ISO/IEC 23008-2) is a recent video coding standard developed in Joint Collaborative Team-Video Coding (JCT-VC), a collaborative project between Moving Pictures Experts Group (MPEG) and International Telecommunication Unit (ITU) Telecommunication Standardization Sector (ITU-T). HEVC uses a block based hybrid scheme that exploits spatial (Infra) prediction and temporal (Inter) prediction. The first picture of a video sequence is encoded using Intra prediction only, i.e. an Infra picture, since there is no temporal reference available.
A basic concept in video encoding and decoding is to compress and decompress video data of a video sequence or stream by exploiting spatial and temporal redundancy in the video data. Generally, blocks of pixels, also denoted samples in the art, are encoded and decoded relative to reference blocks of pixels within a same picture (intra prediction) or (an)other picture(s) (inter prediction) of the video sequence. For instance, HEVC specifies 33 directional modes, planar and DC intra prediction modes for intra prediction. The intra prediction modes use data from neighboring prediction blocks, i.e. blocks of pixels, which have been previously decoded. Inter prediction may use data from one or more prediction blocks in (an)other picture(s). These reference blocks are typically identified by a respective motion vector (MV). HEVC allows for two MV modes which are Advanced Motion Vector Prediction (AMVP) and merge mode. AMVP uses data from the reference picture and can also use data from adjacent prediction blocks. The merge mode allows for the MVs to be inherited from neighboring prediction blocks. The difference between the current block of pixels and the reference block of pixels is then encoded and used, together with representation of intra prediction mode or MV data, as encoded representation of the block of pixels. The resulting encoded bitstream output from the encoder is then decoded at the decoder to get decoded representations of the pictures in the video sequence or stream.
An extension of HEVC is a scalable extension (SHVC) that allows for a single encoded bitstream to contain different versions of the same video with different resolutions and/or quality. Prediction between the layers is allowed in order to improve coding efficiency compared to sending the different versions of the video as independent streams. A special use case of scalable video coding is Adaptive Resolution Change (ARC) using the layers to create an adaptive video bitstream. When the resolution needs to be changed, the adaptive video coder switches to a layer with a resolution that is suitable for the current network conditions, and continues the coding. Resolution change can also be done within one layer and in the non-scalable version of HEVC but for that case, each time the resolution is changed, it is required that an Intra picture is used, which reduces coding efficiency.
Each encoded picture in an SHVC stream is associated with a Picture Order Count (POC) value representing the output order of pictures. A picture with a higher POC value is output later than a picture with a lower POC value. In SHVC there can be pictures from different layers with the same POC value, which are said to belong to the same Access Unit (AU). This typically means that they represent different versions of the same original image, e.g. one full resolution and one down-sampled, and that if they are output they will be output at the same time. When more than one picture is signaled in the same AU they must belong to different layers, i.e. have different layer identifiers nuh_layer_id.
The lowest layer in a scalable bitstream is called the base layer and has layer identifier zero. Higher layers are called enhancement layers and have layer identifiers larger than zero. In SHVC, Intra Random Access Point (IRAP) pictures in enhancement layers are a type of picture that do not reference any pictures of the enhancement layer. However, reference to the base layer is allowed. The IRAP pictures also prohibit pictures that follow the IRAP picture in decoding order to reference pictures that precede the IRAP picture in decoding order.
In typical SHVC streams there are pictures in all layers in every AU, at every time instance, but in the case of ARC the encoder would typically chose to set the single_layer_for_non_irap_flag equal to 1. single_layer_for_non_irap_flag equal to 1 indicates that there are at most two pictures in each AU, i.e. each time instance, and when there are two pictures in the same AU the one in the highest layer must be an IRAP picture.
When this flag is equal to one there is in general only one picture in each AU, either in the base layer or in an enhancement layer. The only exception is when the enhancement layer picture is signaled as an IRAP picture. In that AU it is allowed to have pictures in both base layer and enhancement layer(s). This means that the base layer picture can be used for prediction by the enhancement layer. It also means that the enhancement layer picture cannot reference any pictures in the enhancement layer since it is coded as an IRAP picture. Traversing from a lower layer to a higher is called up-switching. Respectively traversing to a lower layer is called down-switching.
In a typical case using SHVC for adaptive resolution change a prediction structure as shown in FIG. 1 would be favorable for a layer switch. The reason being that it is then sufficient to provide only a single picture at each time instance in the video. However, an IRAP picture is constrained to only contain intra prediction or inter-layer prediction from pictures with the same POC. This means that at a switching point 2, pictures 12, 22 must exist at both layers 10, 20, as shown in FIG. 2, in order not to restrict the IRAP to intra-only coding. When SHVC is used for adaptive resolution change, the decoder is typically intended to only output one of these pictures 12, 22. However, a straight-forward encoder implementation will not take this into consideration and encode both pictures 12, 22 at the switching point 2 as efficiently as possible, which will result in that bits are spent to encode details in a picture that will never be displayed. This redundant coding increases the size of the bitstream and adds to both encoding and decoding complexity.