The High Efficiency Video Coding (HEVC) video coding standard [1] is currently in the process of being extended to support scalable layers. The plans include spatial, quality, and multiview scalable extensions in which bitstreams contain a base layer and a variable number of enhancement layers. The most recent draft version of the HEVC scalable extension was JCTVC-N1008_v3 [2].
The base layer in a scalable stream complies with the first version of HEVC which does not include any scalable extensions. The enhancement layers in the scalable stream are marked as extension data using types that were defined as reserved in the first version of HEVC. In the HEVC specification the type is indicated using high level signaling and is carried in the Network Abstraction Layer (NAL) unit type. It was in the first version specified that first version decoders, i.e. decoders compliant with the first version of HEVC, will discard data marked with extension types. This enables legacy or first version decoders to correctly decode the base layer in scalable bitstreams while ignoring the enhancement layers. Decoders compliant to an extension of HEVC will recognize the extension data types and be able to also decode the enhancement layers in the scalable bitstream.
Example picture structures using scalability are often illustrated as shown in FIG. 1. The figure illustrates spatial scalability with a lower resolution base layer 10 with three pictures 14, 16, 18 and a higher resolution enhancement layer with three pictures 24, 26, 28. The arrows in the figure show how the pictures reference each other. As you can see, the enhancement layer pictures 24, 26, 28 use the base layer pictures 14, 16, 18 for reference. Since no enhancement layer picture 24, 26, 28 is used for reference by a base layer picture 14, 16, 18, the enhancement layer 20 can be removed, which would result in a first version bitstream (version one bitstream) that would be decodable by a first version decoder (version one decoder). FIG. 1 contains three time instances, for which there are two pictures 14, 24; 16, 26; 18, 28 each, one base layer picture and one enhancement layer picture. These pictures 14, 24; 16, 26; 18, 28 are grouped and called access units (AUs) 30. There are three access units 30 in FIG. 1, of which one is marked with reference signs in order to simplify the figure.
The first version of HEVC is actually somewhat scalable as well since it includes temporal layers. This is illustrated in FIG. 2, which shows three pictures 14, 16, 18 in temporal layer 0 (pictures 2, 4, and 6) 10 and two pictures 24, 26 in temporal layer 1 (pictures 3 and 5) 20. Similar to the spatial scalability example, temporal layer 1 20 can be discarded. In this example the result is a bitstream with half the frame rate. There are five access units 30 in FIG. 2, of which one is marked with reference sign.
A bitstream in which one or more higher layers have been removed is called a subbitstream, regardless whether the higher layer is a temporal layer or a regular layer.
An important rule in HEVC regarding all types of scalability is that no picture of a lower layer is allowed to use a picture of a higher layer for reference. This rule is important as it preserves decodeability for bitstreams in which higher layers have been removed. Another important rule is that if an encoder outputs a scalable bitstream, the encoder is responsible that every possible subbitstream is a compliant bitstream. This means that a network node or any other entity can blindly discard any higher layer combination from a scalable bitstream and be assured that the output is a compliant bitstream. This rule is generally denoted the compliance rule.
Every picture in HEVC has a picture order count (POC) value assigned to it. This value defines the order in which pictures are output; pictures are always output in increasing POC order. Output of pictures from a decoded picture buffer (DPB) is typically output for display on a screen. However, output as used herein also encompass output for other purposes than display including, but not limited to, output for storage on a memory, output for transcoding, etc.
The range of allowed POC values is from −231 to 231−1, so in order to save bits in the slice header, only the least significant bits (LSB) of the POC (POC lsb) are signaled. This is done by the slice_pic_order_cnt_lsb codeword in HEVC. The number of bits to use for POC lsb is signaled in the sequence parameter set (SPS) using the codeword log 2_max_pic_order_cnt_lsb_minus4 and can be between 4 and 16. Since only the POC lsb is signaled in the slice header, the most significant bits (MSB) of the POC (POC msb) for the current picture are derived from the POC of previous pictures and the POC lsb of the current picture. HEVC defines the derivation of the most significant POC bits in equation 8-1 in section 8.3.1 of the HEVC standard [1].
The HEVC standard specifies a number of picture types that have different characteristics. An important set of picture types are the Intra random access point (IRAP) picture types. Those are intra coded pictures that provide random access points in a bitstream.
IRAP pictures in a scalable bitstream can either be aligned or non-aligned. This can be signaled in JCTVC-N1008_v3 [2] using the video parameter set (VPS) codeword cross_layer_irap_aligned_flag, see section F.7.4.3.1.1 in [2]. If this flag is 1 and one of the pictures in an access unit is an IRAP picture, all pictures in that access unit must be IRAP pictures of the same IRAP type. If cross_layer_irap_aligned_flag is set to 0 in the VPS, IRAP pictures are not required to be aligned and the picture structure shown in FIG. 3 is allowed.
An IRAP picture in the base layer provides a point where it is possible to start decoding. An IRAP picture in an enhancement layer provides a point where it is possible to start decoding that layer given that the layer(s) below is(are) being decoded. If IRAP pictures are aligned, the access unit containing IRAP pictures in all layers provide a point where it is possible to start decoding any number of layers.
If IRAP pictures are non-aligned, such as the pictures in FIG. 3, the process from making a random access in the base layer until a certain number of enhancement layers are being decoded is called the layer-wise startup process. The process can be summarized like this:    1 Before decoding, no layer is considered initialized.    2 When an IRAP picture is encountered in a base layer, that layer is immediately considered initialized.    3 When an IRAP picture is encountered in an enhancement layer, that layer is considered initialized if all layers that the enhancement layer is dependent on are initialized.    4 When a non-IRAP picture is encountered, that picture is decoded if it belongs to an initialized layer and discarded if it belongs to an uninitialized layer.
The layer-wise startup process would decode the grey pictures in FIG. 3. Note that IRAP pictures in enhancement layers may reference pictures in lower layers while IRAP pictures in the base layer are not allowed to reference any other picture.
There are three types of IRAP pictures in HEVC, the instantaneous decoding refresh (IDR), clean random access (CRA), and broken link access (BLA) types.
An IDR picture in the base layer is an intra picture that refreshes the decoder. This means that neither the IDR picture nor any base layer picture that follows the IDR picture in the bitstream can have any dependency to any picture that precedes the IDR picture in the bitstream. There is no codeword (POC lsb) signaled for base layer IDR pictures in HEVC, instead the POC is set equal to 0.
The base layer CRA picture is an intra picture that in contrast to the IDR picture does not refresh the decoder. This allows for base layer pictures later in the bitstream to depend on base layer pictures before the CRA picture in the bitstream. These pictures that precede the CRA picture in output order are called leading pictures and are not allowed to be displayed after the CRA picture, only before. All leading pictures in HEVC must be identifiable by using special picture types.
Besides using a base layer CRA picture for random access it is also possible to use the CRA picture for splicing video streams where a particular base layer IRAP picture in the middle of a bitstream and all subsequent pictures are replaced by a base layer CRA picture with its subsequent pictures from another bitstream, see FIG. 4. Since the base layer CRA picture may have leading pictures, which become undecodable when the CRA picture is used for splicing, the HEVC standard has defined the BLA picture type to use for CRA splicing. During splicing the CRA picture is simply re-typed as a BLA picture, see FIG. 4. The BLA picture type instructs the decoder to discard the leading pictures, and this makes splicing work. Another alternative would have been to remove the leading pictures during splicing but then system buffer parameters would need to be recalculated since data has been removed.
POC lsb is signaled for every picture in HEVC except the IDR picture type, for this IRAP picture type the POC is set to 0. The POC msb is set to 0 for BLA pictures but the POC lsb are present in the BLA slice header and the POC of the BLA picture is set equal to the POC lsb. The POC for CRA pictures is calculated as described above using POC lsb and POC msb.
A very important rule in HEVC is that the POC of every picture in the same access unit must be identical. This rule makes it easy to detect access unit boundaries; as soon as the POC of two pictures are different they belong to different access units and it makes it easier to detect picture losses. This rule is generally referred to as the POC alignment rule.
Another rule regarding BLA and IDR pictures is that they reset the decoder. Among other things this flushes the decoder and forces all picture that have not yet been output to be output. Note that CRA pictures do not flush the decoder. This rule is generally referred to as the IRAP output rule.
Another rule regarding POC values is that the current picture and all its short-term reference pictures must be within a certain POC range which is half of the maximum value of POC lsb. If POC lsb is coded using 8 bits, the maximum value is 28−1=255 and the allowed POC range is 255/2=127. This rule is generally referred to as the POC range rule.
The POC alignment rule and the rule that IDR pictures always have POC equal to zero poses a problem that is solved by the poc_reset_flag in JCTVC-N1008_v3, see section F.7.4.7.1 in [2]. The problem is illustrated in FIG. 5, where the POC value of each picture 12, 14, 16, 22, 24, 26 is shown inside each picture (the following figures in this document also show their POC values inside each picture). Only one of the pictures 12 is an IRAP picture, this is the IDR picture in FIG. 5, all other pictures 14, 16, 22, 24, 26 are non-IRAP pictures. Assume now that the encoder has encoded a lot of pictures so the last POC is 1032. Remember that only the POC lsb is signaled in the slice header and if we assume that 8 bits are used to signal POC lsb, the last POC was signaled by the value 1032%256=8, while the POC msb of 1032−8=1024 was derived. Now the encoder wants to put an IDR picture 12 in the base layer 10 without making any enhancement layer picture 22 in the same access unit 30 IRAP.
The IDR picture 12 sets the POC to 0, and the POC alignment rule says that all pictures 12, 22 in the same access unit 30 must have the same POC value. This means that both pictures 12, 22 in the access unit 30 containing the IDR picture 12 must have POC equal to 0. One problem here is that the enhancement layer picture 22 uses the picture 24 in the enhancement layer 20 with POC equal to 1032 for reference and this would violate the POC range rule given that picture 24 with POC 1032 is a short-term picture (1032−0>127). Another problem is that if the POC is set to 0 for the enhancement layer picture 22, that picture 22 gets a lower POC than the picture 24 with POC 1032, which would indicate that the enhancement layer picture 22 should be output before the picture 24 with POC 1032. This is not a problem for the base layer 10 due to the IRAP output rule but it is a problem for the enhancement layer picture 22.
The poc_reset_flag in JCTVC-N1008_v3 [2] solves this problem. The poc_reset_flag is a flag in the slice header of enhancement layer pictures that has the following effect on the decoder:                a) The POC is derived normally, deriving the POC msb and using the POC lsb as signaled in the slice header.        b) Then, the POC of every reference picture is decremented by the derived POC value.        c) Finally, the POC of the enhancement layer picture is set equal to 0.        
Signaling a POC lsb of 10 (POC=POC msb+POC lsb=1024+10=1034) for the enhancement layer picture 22 of the access unit 30 containing the IDR picture 12 would result in the POC numbers shown in FIG. 6 after step a has taken place and FIG. 7 shows the POC numbers after steps a, b, and c have taken place. In practice not all pictures shown with negative POC values are still reference pictures at the time steps a, b, and c are carried out in this example, but here we assume that they are.
As you can see, FIG. 7 shows that the problem of POC range and output are both solved by the poc_reset_flag. For instance, the correct output order is maintained and first a picture 16, 26 with POC 20=−4 is output prior to a picture 14, 24 with POC=−2 and finally a picture 12, 22 with POC=0. The POC range is also correct since 0−(−2)=−2−(−4)=2<127. The poc_reset_flag is sent in the slice header extension part of HEVC. This is a part of the HEVC slice header that will be ignored by legacy or first version HEVC decoders.
There are, however, a number of problems with the existing solution in terms of error resilience problem, non-existing picture problem, temporal layer problem and bit cost problem.
The Error Resilience Problem
Consider the video stream shown in FIG. 8. If picture X 22 is lost during transmission, chances are that the POC of picture Y 21 is calculated based on the picture 24 with POC 1032. It would then get POC equal to 1026 (POC msb=1024 and POC lsb=2). Since the poc_reset_flag is carried in picture X 22, which is lost, no recalculation of POC values are done. The loss of picture X 22 therefore causes the following problems:                Picture Y 21 gets POC 1026, which violates the POC alignment rule (1026≠2).        Picture Y 21 gets POC 1026, which is lower than 1032, and the relative output order is incorrect.        If poc_reset_flag would have been received, POC 1032 would have been recalculated to −2. The POC of picture Y 21 would be 2 and therefore the reference picture set (RPS) of picture Y 21 would use a delta POC of −4 to indicate the reference picture. But POC of picture Y 21 is equal to 1026, which would specify a reference picture with POC 1022, which is incorrect.        
The Non-Existing Picture Problem
Consider the video stream shown in FIG. 9. If the enhancement layer 20 has half the frame rate of the base layer 10, there is no corresponding picture to the IDR picture 12. This means that the POC cannot be reset in the enhancement layer 10 and the picture structure as shown in FIG. 9 is not possible.
The reason for not having an enhancement layer picture at the IDR picture is that a smaller maximum size of any access unit is desirable and the IDR access unit becomes smaller when there are no enhancement layer pictures there. A solution that would support this structure would be better than to forbid the shown picture structure.
The Temporal Layer Problem
Consider the video stream shown in FIG. 10. Here the enhancement layer consists of two temporal layers 20A, 20B to provide temporal scalability and picture X 42 is put in a higher temporal layer 20B. This is well if all pictures are received. However, if a node removes temporal layer 1 20A in the enhancement layer, the decoder faces the non-existing picture problem. Therefore, the picture structure in FIG. 10 is actually forbidden according to the prior art, which limits the encoder flexibility.
The Bit Cost Problem
Further, the current approach mandates that one bit in each slice header is used to signal poc_reset_flag. A more cost-effective solution is desired.
Thus, there is room for improvements with regard to scalable bitstreams, in which a video stream comprises pictures in multiple co-called layers.