Scalable video coding refers to coding structure where one bitstream can contain multiple representations of the content at different bitrates, resolutions or frame rates. A scalable bitstream typically consists of a “base layer” providing the lowest quality video available and one or more enhancement layers that enhance the video quality when received and decoded together with the lower layers. In order to improve coding efficiency for the enhancement layers, the coded representation of that layer typically depends on the lower layers.
A scalable video codec for quality scalability (also known as Signal-to-Noise or SNR) and/or spatial scalability may be implemented as follows. For a base layer, a conventional non-scalable video encoder and decoder are used. The reconstructed/decoded pictures of the base layer are included in the reference picture buffer for an enhancement layer. In codecs using reference picture list(s) for inter prediction, the base layer decoded pictures may be inserted into a reference picture list(s) for coding/decoding of an enhancement layer picture similarly to the decoded reference pictures of the enhancement layer. Consequently, the encoder may choose a base-layer reference picture as inter prediction reference and indicate its use typically with a reference picture index in the coded bitstream. The decoder decodes from the bitstream, for example from a reference picture index, that a base-layer picture is used as inter prediction reference for the enhancement layer.
In addition to quality scalability, scalability can be achieved through spatial scalability, where base layer pictures are coded at a higher resolution than enhancement layer pictures, bit-depth scalability, where base layer pictures are coded at lower bit-depth (e.g. 8 bits) than enhancement layer pictures (e.g. 10 or 12 bits), and chroma format scalability, where base layer pictures provide higher fidelity in chroma (e.g. coded in 4:4:4 chroma format) than enhancement layer pictures (e.g. 4:2:0 format).
In certain cases, it would be desirable to enhance only an area within the picture instead of an entire enhancement layer picture. However, if implemented in current scalable video coding solutions, such scalability would either have too much complexity overhead or suffer from coding efficiency. For example, considering bit-depth scalability, where only an area within the video picture is targeted to be coded at higher bit-depth, current scalable coding solutions nevertheless require the entire picture to be coded at high bit-depth, thus drastically increasing the complexity. For the case of chroma format scalability, the reference memory of the entire picture should be in 4:4:4 format, even if only a certain region of the image is enhanced, thus increasing the memory requirement.
It has been proposed to use a SEI message to indicate restricted encoding for a set of tiles in a picture, where the motion compensation of the tile is restricted so that the samples outside the set of tiles is not utilized and the set of tiles represents an independently decodable region. While providing improved coding efficiency for enhancing only an area within a picture, such motion constrained tile sets SEI message is limited to define only intra-layer prediction dependencies.