H.264 Video Compression
H.264 (Moving Picture Experts Group-4 Advanced Video Coding (MPEG-4 AVC)) is the state of the art video coding standard. It consists of a block-based hybrid video coding scheme that exploits temporal and spatial redundancies. The H.264/AVC standard is defined in a specification text that contains many decoding processes that have to be executed in the specified sequence in order for a decoder to be compliant to the standard. There are no requirements on the encoder but it is often the case that the encoder also executes most of the processes in order to achieve good compression efficiency.
H.264/AVC defines a decoded picture buffer (DPB) that stores decoded pictures after they have been decoded. This means that the decoder is required to use a defined amount of memory in order to decode a sequence. The DPB contains pictures that are used for reference during decoding of future pictures. “Used for reference” here means that a particular picture is used for prediction when another picture is decoded. Pixel values of the picture that is used for reference may then be used to predict the pixel values of the picture that is currently decoded. This is also referred to as Inter prediction. The DPB additionally contains pictures that are waiting for output. “Output” here means the function where a decoder outputs a picture outside the decoder. The H.264 specification describes how a bitstream is converted into decoded pictures that are then output, see FIG. 1. The output pictures may e.g. be displayed or written to disk.
One common reason for a picture in the DPB to be waiting for output is that there is a picture that has not been decoded yet that will be output before the picture.
FIG. 2 shows an example of three pictures: A, B, and C. The decoding order is the order in which the pictures in compressed format are fed into the decoder. This is typically the same order in which the pictures are encoded by the encoder. FIG. 2 shows that the decoding order in this example is A, B and C. The output order is the order in which the decoded pictures are output. The output order does not have to be the same as the decoding order as is illustrated in the example in FIG. 2 where the output order is A, C, B. The arrows in the figure show which pictures that are used for reference for each picture: picture A is used for reference for both picture B and C.
In FIG. 2, picture C is decoded after B but output before it. When picture B has been decoded, it can not be output immediately since picture C has not been decoded yet and has to be output before picture B. Therefore, picture B has to be stored in the DPB after it has been decoded even if it is not used for reference by any other picture. When decoding picture C, picture A must also be present in the DPB since picture C uses picture A for reference.
Output order is controlled by signaling a PictureOrderCount (POC) value. There are syntax elements in the bitstream to convey the POC of every picture and these values are used in order to define the output order of pictures.
To keep track of the DPB, H.264/AVC contains three processes that take place after a picture has been decoded: the picture marking process, the picture output process and the free-up process.
The picture marking process marks pictures as either “used for reference” or “unused for reference”. A picture marked as “used for reference” is available for reference which means that a subsequent picture in decoding order may use the picture for reference in its decoding processes. A picture marked as “unused for reference” cannot be used for reference by subsequent pictures. This process is controlled by the encoder through the bitstream. There is optional syntax in the H.264/AVC bitstream that when present indicates what pictures to mark as “unused for reference”. This operation is often referred to as the memory management control operation (MMCO). If there is no optional MMCO syntax, a first-in, first-out mechanism is defined, called the “sliding window” process. The sliding window process means that when the last decoded picture would result in too many pictures in the DPB, the oldest picture in decoding order is automatically marked as “unused for reference”.
The picture output process, which is done after the picture marking process, marks pictures as either “needed for output” or “not needed for output”. A picture marked as “needed for output” has not been output yet while a picture marked as “not needed for output” has been output and is no longer waiting for output. The picture output process also outputs pictures. This means that the process selects pictures that are marked as “needed for output”, outputs them and thereafter marks them as “not needed for output”. The picture output process determines in which order pictures are output. Note that the picture output process may output and mark zero, one or many pictures after one particular picture has been decoded.
After these two processes have been invoked by the decoder the free-up process is invoked. Pictures that are marked both as “unused for reference” and “not needed for output” are emptied and removed from the DPB. This is sometimes referred to as one of the DPB picture slots has been made free.
The size of the DPB in H.264/AVC is limited. This means that the number of pictures that can be stored because they are waiting for output or made available for reference is limited. The variable max_dec_frame_buffering denotes the size of the DPB, sometimes referred to as the number of picture slots there are in the DPB. The encoder has to ensure that the DPB size never overflows.
The three processes are described in the standard. This means that the decoder is controlled by the encoder and therefore the decoder does not have any freedom regarding output order. It is all determined by the picture output process and the related elements in the bitstream sent by the encoder. A simplified flow chart for the decoding steps of H.264/AVC is shown in FIG. 3.
The picture output process in H.264 defines the order in which pictures shall be output. A decoder that outputs pictures in the correct order is output order compliant. A decoder may follow the picture output process described in H.264 but it is sometimes possible to use the variable num_reorder_frames to output pictures earlier than what is given by the picture output process. num_reorder_frames indicates the maximum number of pictures that precede any picture in decoding order and follow it in output order.
FIG. 4 shows an example where picture B has just been decoded. But picture B cannot be output since it is not known whether picture C is to be output before or after picture B. If the encoder has decided that the output order is the same as the decoding order, it can indicate a num_reorder_frames value of 0 to the decoder. The encoder has thereby promised that picture C in the example will be output after picture B and a decoder can output picture B immediately when it has been decoded. In this case, when num_reorder_frames is 0, there is no additional reordering delay in the decoder. If num_reorder_frames in the example is set to 1, it is possible that picture C is to be output before picture B. With num_reorder_frames equal to 1, there is an additional reordering delay of 1 picture, with num_reorder_frames is equal to 2, the reordering delay is 2 pictures and so on.
HEVC Video Compression
High Efficiency Video Coding (HEVC), also referred to as H.265, is a video coding standard developed in Joint Collaborative Team-Video Coding (JCT-VC). JCT-VC is a collaborative project between MPEG and International Telegraph Union Telecommunication Standardization Section (ITU-T). HEVC includes a number of new tools and is considerably more efficient than H.264/AVC. HEVC also defines a temporal_id for each picture, corresponding to the temporal layer the picture belongs to. The temporal layers are ordered and have the property that a lower temporal layer never depends on a higher temporal layer. Thus, higher temporal layers can be removed without affecting the lower temporal layers. The removal of temporal layers can be referred to as temporal scaling. An HEVC bitstream contains a syntax element, max_sub_layers_minus1, which specifies the maximum number of temporal layers that may be present in the bitstream. A decoder may decode all temporal layers or only decode a subset of the temporal layers. The highest temporal layer that the decoder actually decodes is referred to as the highest temporal sub-layer and may be set equal to or lower than the maximum numbers of layers as specified by max_sub_layers_minus1. The decoder then decodes all layers that are equal to or lower than the highest temporal sub-layer. The highest temporal sub-layer may be set by external means.
Note that the description above is not specific for temporal layers, but also holds for other types of layers such as spatial layers and quality layers, etc. The temporal layer that the decoder then decodes is referred to as the highest decoded layer.
The decoding flow of HEVC is slightly different to H.264/AVC. HEVC has a DPB, a picture marking process that marks pictures as “used for reference” and “unused for reference”, a picture output process that marks pictures as “needed for output” and “not needed for output” and a free-up process. Like H.264/AVC, HEVC also uses POC values to define the picture output order. A POC value is in HEVC represented by the variable PicOrderCntVal, where pictures are output in increasing PicOrderCntVal order.
HEVC does, however, not have MMCO or sliding window process. Instead, HEVC specifies that a list of the pictures that are marked as “used for reference” is explicitly sent in each slice header. The picture marking in HEVC uses this list and ensures that all pictures in the DPB that are listed are marked as “used for reference” and that all pictures in the DPB that are not listed are marked as “unused for reference”. The list is called the reference picture set (RPS) and sending one in each slice header means that the state of the reference marking in the DPB is explicit and repeated in each slice, which is not the case in H.264/AVC.
Since RPSs are used in HEVC, the picture marking process, the picture output process and the free-up process are all done after the parsing of the first slice header of a picture, see FIG. 5.
The num_reorder_frames functionality as described for H.264/AVC is also present in HEVC. An HEVC bitstream contains a syntax element for each temporal layer, denoted max_num_reorder_pics[i], where i is the temporal layer. The function of max_num_reorder_pics[i] is the same as num_reorder_frames but each codeword here indicates the maximum allowed number of pictures in the same or lower temporal layer that precedes a picture in decoding order and succeeding that picture in output order.
Consider the example in FIG. 6 where the decoding order is A, B, C, D, E and the output order is A, D, C, E, B. This is a structure of pictures that uses temporal layers where pictures A and B belong to the lowest temporal layer (layer 0), picture C belongs to a middle temporal layer (layer 1) and pictures D and E belong to the highest temporal layer (layer 2). The arrows in the figure show which pictures that are used for reference by other pictures. For example, picture A is used for reference by picture B since there is an arrow from picture A to picture B. Best use of max_num_reorder_pics in HEVC is to set it as low as possible to reduce the output delay as much as possible. The lowest possible values of max_num_reorder_pics for each temporal layer are shown in FIG. 6. The reason it is 0 for the lowest layer is because there is no picture in layer 0 that precedes any picture in decoding order but follows it in output order. For layer 1, we have picture B that precedes picture C in decoding order but follows it in output order, and for layer 2 we have pictures B and C that both precedes picture D in decoding order but follows it in output order.
If a decoder knows that it will only decode temporal layer 0, it could potentially output picture B as soon as it has been decoded but if the decoder decodes all layers it can not. It could then have to wait until there are two decoded pictures that follow B in output order.
JCTVC-K0030_v3, Proposed Editorial Improvement for High efficiency video coding (HEVC) Text Specification Draft 8, B. Bross et al., JCT-VC of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11 11th Meeting, Shanghai, 10-19 Oct. 2012 as published on 12 Sep. 2012 discuses usage of max_num_reorder_pics in section 7.4.2.1 on page 62 and section 7.4.2.2 on page 64.
No_Output_of_Prior_Pics_Flag
Both H.264 and HEVC bitstream specifies a flag called no_output_of_prior_pics_flag. This flag is present in the slice header of random access pictures (RAP). Random access pictures are pictures from which it is possible to tune into a stream. They guarantee that decoding of future pictures can be done correctly if a decoder starts decoding from the random access point. The decoder does not have to be fed any data containing pictures that precede the random access picture in decoding order for tune-in to work.
The no_output_of_prior_pics_flag specifies how the previously-decoded pictures in the decoded picture buffer are treated after decoding of a random access picture. In short, if no_output_of_prior_pics_flag is equal to 1, no pictures in the DPB that are marked as “needed for output” should be output, but if no_output_of_prior_pics_flag is equal to 0 they should be output.
Consider FIG. 7 that shows an example where max_num_reorder_pics is 0 and picture C is a random access picture with no_output_of_prior_pics_flag equal to 1. In H.264, it would be possible to output picture B immediately after it has been decoded. This is not the case in the current HEVC specification since the decoder does not know immediately after picture B has been decoded whether picture C is a RAP picture with no_output_of_prior_pics_flag equal to 1 or not. If picture C is not such a picture, picture B could be output immediately after it has been decoded. But if picture C is indeed a RAP picture with no_output_of_prior_pics_flag equal to 1, picture B should not be output, since picture B is marked as “needed for output” when the slice header of picture C is decoded.
Since the picture output process in HEVC is done when the slice header is parsed and no_output_of_prior_pics_flag is an important feature, there is a higher output delay in the current HEVC standard than in H.264/AVC.
Information of usage of no_output_of_prior_pics_flag is disclosed in section 7.4.7.1 on page 75 and section C.5.2 on page 26 in JCTVC-K0030_v3.
The advantage by using RPSs in HEVC is that it is much more error resilient compared to the H.264/AVC method. Also, temporal scalability is more straightforward. A problem with the HEVC solution is that it introduces additional delay regarding picture output compared to H.264/AVC. In H.264/AVC, pictures can be output after a picture has been decoded. In HEVC, the decoder has to wait for the slice header of the next picture to be parsed until pictures are output. This causes a delay.
Hence, there is a need to solve the shortcomings of the prior art video coding and in particular delay problems that may occur in the video coding of the prior art.