Video coding and decoding using inter-picture prediction with motion compensation has been previously used. Uncompressed digital video can consist of a series of pictures, each picture having a spatial dimension of, for example, 1920×10x0 luminance samples and associated chrominance samples. The series of pictures can have a fixed or variable picture rate (informally also known as frame rate) of, for example, 60 pictures per second or 60 Hz. Uncompressed video has significant bitrate requirements. For example, 10x0p60 4:2:0 video at x bit per sample (1920×10×0 luminance sample resolution at 60 Hz frame rate) requires close to 1.5 Gbit/s bandwidth. An hour of such video requires more than 600 GByte of storage space.
One purpose of video coding and decoding can be the reduction of redundancy in the input video signal, through compression. Compression can help reducing aforementioned bandwidth or storage space requirements, in some cases by two orders of magnitude or more. Both lossless and lossy compression, as well as a combination thereof can be employed. Lossless compression refers to techniques where an exact copy of the original signal can be reconstructed from the compressed original signal. When using lossy compression, the reconstructed signal may not be identical to the original signal, but the distortion between original and reconstructed signal is small enough to make the reconstructed signal useful for the intended application. In the case of video, lossy compression is widely employed. The amount of distortion tolerated depends on the application; for example, users of certain consumer streaming applications may tolerate higher distortion than users of television contribution applications. The compression ratio achievable can reflect that: higher allowable/tolerable distortion can yield higher compression ratios.
A video encoder and decoder can utilize techniques from several broad categories, including, for example, motion compensation, transform, quantization, and entropy coding, some of which will be introduced below.
Certain video codecs before H.264 such as, for example, MPEG-2 visual used a hierarchy of transient headers, including a sequence header, group of picture (GOP) header, picture header, and slice header. Syntax elements included in each header pertain to all underlying syntax structures. For example, syntax elements of the sequence header pertain to all GOPs included in the sequence, all pictures included in those GOPs, and all slices included in those pictures. Syntax elements of the GOP header pertain to all pictures included in the GOP, and all slices in the pictures. Such a hierarchical structure can lead to efficient coding but suboptimal error resilience properties. For example, if the vital information of a sequence header is lost in transmission, none of the GOPs, pictures, or slices of the sequence can be decoded.
Certain ITU and MPEG video codecs from 2003 onwards, namely H.264 and H.265, do not use transient headers above the slice header. Instead, they rely on parameter sets. On each syntactical level, such as sequence or picture level, one or more parameter set may be received by the decoder from the bitstream or by external means. Which of these (potentially many) parameter sets of the same type are being used for the decoding of a given sequence or picture depends on the reference coded in, for example, the slice header (for the picture parameter set, PPS) or the PPS (for the sequence parameter set, SPS). This architecture can have the advantage that the relevant parameter sets can be reliably sent even if the bitstream itself is sent over a lossy channel, or that the likelihood of their reception can be increased through the sending of redundant copies, potentially well in advance of their first use. One disadvantage can be that the sending of a parameter set can be more costly, in terms of bits required for the same number and types of syntax elements than the sending of MPEG-2 style headers. Further, certain syntax elements that change frequently from picture to picture but stay constant within a given picture may, under this architecture, be included in the form of multiple redundant copies in each slice header. While doing so can make the slices independently decodable (at least from a parsing dependency end entropy decoding viewpoint), it can cost further bits.
During the design of H.264, the independent decodability of slices was considered a major design goal, for error resilience reasons. Since 2003, however, improvements in the network architectures used for conveying coded video, as well as advances in the prediction mechanism, have made the independent decodability of slices considerably less attractive, as the concealment of a lost slice has become less and less effective.