Video coding refers to techniques where a series of uncompressed pictures is converted into a compressed, video bitstream. Video decoding refers to the inverse process. Many standards exist that specify techniques for image and video decoding operations, such as ITU-T Rec. H.264 “Advanced video coding for generic audiovisual services”, March/2010, available from the International Telecommunication Union (“ITU”), Place de Nations, CH-1211 Geneva 20, Switzerland or http://www.itu.int/rec/T-REC-H.264, and incorporated herein by reference in its entirety, or High Efficiency Video Coding (HEVC) ((B. Bross et. al., “High Efficiency Video Coding (HEVC) text specification draft 9”, available from http://phenix.int-evry.fr/jct/doc_end_user/documents/11_Shanghai/wg11/JCTVC-K1003-v13.zip), December 2012, referred to as “WD9” henceforth, which is incorporated herein by reference in its entirety).
Layered video coding, also known as scalable video coding, refers to video coding techniques in which the video bitstream can be separated into two or more sub-bitstreams, called layers. Layers can form a layer hierarchy, whereby a base layer can be decoded independently, and enhancement layers can be decoded in conjunction with the base layer and/or lower enhancement layers.
Some video decoding standards, such as H.264 or HEVC, utilize a profile/level system to signal in the bitstream the capabilities a decoder must possess to decode the bitstream. Profiles typically refer to a selection of coding technologies (known as “tools”) specified in the video coding standards, whereas levels typically refer to a requirement of decoding a certain number of pixels, blocks, macroblocks, treeblocks, coding units, or similar units, per second. Therefore, levels can express the capability of a decoder to decode a bitstream up to a given (uncoded) picture size at a certain frame rate. Profiles and levels can be specified in a video coding standard such as H.264 or HEVC, in application standards, or can be agreed upon by vendors outside a standards process.
H.264 includes in its Annex G an extension to support layered coding, known as Scalable Video Coding or SVC. Annex H includes a multiview extension henceforth referred to as Multiview Video Coding or MVC. H.264 without enabled annexes G or H is referred to as AVC.
In SVC, multiple spatial, quality, or temporal layers may be coded, and a layer may be coded dependently upon another layer. The base layer is independent of any other layers, and is backwards compatible with AVC. SVC can use single-loop decoding for inter coded macroblocks, and multi-loop decoding for intra coded macroblocks.
In MVC, multiple views may be coded, and a view may be coded dependently upon another view. The base view is independent of any other view, and is backwards compatible with AVC. MVC uses multi-loop decoding, where if view A is a reference for view B, both view A and view B must be decoded in order to output view B.
H.264 includes sequence parameter sets, which contain information related all of the coded pictures in a video sequence. Within the sequence parameter set are syntax elements for profile and level indicators. Similarly, in SVC and MVC, the subset sequence parameter set has syntax elements for profile and level indicators. Subset sequence parameter sets are used in non-base layers or views, while sequence parameter sets are used in the base layer or view.
The SVC and MVC extensions provide mechanisms for sub-bitstream extraction of a target layer representation or view representation, whose output is a valid coded video bitstream including the NAL units associated with the target layer representation itself, as well as all layers with lower or equal values of the target dependency_id, quality_id, temporal_id, and priority_id.
In H.264, in each coded slice header there is a picture parameter set id syntax element, which refers to the picture parameter set (PPS). The PPS contains parameters which stay constant for the whole coded picture, but may change between two pictures. One syntax element in the PPS is an index to the sequence parameter set id, which refers to a sequence parameter set (SPS). All coded slices in the same layer in SVC or same view in MVC, throughout the coded video sequence, refer to the same SPS or subset sequence parameter set.
The sequence parameter sets can contain information about image resolution, video usability information, etc., as well as profile and level indicators. It is allowable for more than one view in MVC to refer to the same sequence parameter set. Similarly, it is allowable for more than one SVC layer to refer to the same sequence parameter set.
H.264 places various restrictions on compliant coded bitstreams through its profile and level indicators. Profile and level indicators can specify a conformance point, and the presence of profile and level information in a bitstream can allow a decoder or Media Aware Network Element (MANE) to determine if it has the capability to decode or otherwise process a particular bitstream. Profiles generally specify the set of supported coding tools, while levels generally specify constraints that impact computational demands.
With respect to levels, H.264 provides a table mapping each allowable level_id value to constraints on parameters, such as maximum picture size, bitrate, and macroblock throughput. In particular, the macroblock throughput limit restricts the maximum number of macroblocks per second, or MaxMBPS. As the size of a macroblock is 16×16 samples, MaxMBPS is closely related to the pixel rate per second, except that the MaxMBPS calculation considers that each coded picture must contain an integer number of macroblocks, and hence the vertical and horizontal resolutions must be rounded up to the nearest macroblock size.
For the (single layer/view) AVC profiles, including those used as an SVC base layer or MVC base view, assuming a fixed frame rate, FrameRate, the maximum MB throughput is restricted such that level limit MaxMBPS>=PicSizeInMbs*FrameRate. Note that the description of the constraint of the level limit in the standards document does not assume a fixed frame rate, and is expressed as a limit on the minimum output time between frames. The above equation is a simplification (assuming fixed frame rates) of H.264's description (which allows for variable frame rates).
For the profiles associated with SVC and MVC, where multiple scalable layers or views are to be decoded, the interpretation of the max MB per second throughput is modified, based upon the number of layers or views, as described below.
In the SVC extension, in a subset sequence parameter set for a non-base layer, the level limit expresses a constraint of the maximum MB throughput MaxMBPS>=svcPicSizeInMbs*FrameRate, where the value of svcPicSizeInMbs is based on the number of layers and the picture size of the active layer and its reference layers, again under the assumption of a fixed frame rate.
Referring to FIG. 1, shown is a layer hierarchy with a base layer (101), two spatial or SNR enhancement layers (102) and (103) that use the base layer (101) as their reference layer, and a third spatial or SNR enhancement layer (104) that uses the base layer (101) and enhancement layer (102) as its reference layers. Each of the layers (101) through (104), according to H.264, has a level (105-108, respectively) associated that is coded as the level_id field in the sequence parameter set. The level (105) associated with the base layer (101) can indicate the computational demand of the base layer in isolation, expressed by referring to the level table specified in H.264. In particular, according to H.264, the level_id coded for the base layer (101) can be chosen such by the encoder that all coding parameters associated with that level (such as: maximum picture size, macroblock per second throughput, and so on) are larger or equal than the requirement to decode the bitstream according to that level.
The levels (106-108) for the enhancement layers (102-104), according to H.264, can be coded such that the computational requirements associated with the coded level are larger than the computational requirements for decoding the enhancement layer (102-104) in question, and all its reference layers, in combination. For example, the level indicator (108) for enhancement layer (104) is chosen such that, for all computational requirements indicated through the level indicator (108), the respective computational complexity is larger than the computational complexity required to decode all layer (104) and its reference layers (102) and (101) in combination. In FIG. 1, this is shown by the dashed line (110) surrounding layers (101) (102) and (104). Similarly level indicator (107), coded in the sequence parameter set of enhancement layer (103) can be chosen such that the computational demands for decoding enhancement layer (103) and base layer (101) in combination are lower than what is indicated in the level indicator (107). This is shown by the punctuated line (111) around layers (101) and (103).
In the MVC extension, in a subset sequence parameter set for a non-base view, the level limit expresses a constraint of the maximum MB throughput MaxMBPS>=(NumViews/2)*PicSizeInMbs*FrameRate, where NumViews refers to the number of views required for decoding the target output view, once more under the assumption of a fixed frame rate.
In MVC, because the MB throughput limit is based upon the number of views, multiple view layers whose SPS parameter values are otherwise identical, e.g. with the same image resolution and VUI data, can differ in their level indicator value. If they do, they can refer to different SPS ids, because, in H.264, each SPS can contain only a single level indicator. The alternative, referring to the same SPS with a level indicator sufficiently high to indicate computational resources sufficient to decode all views (even if, for example, only the base view is to be decoded), may be problematic. For example, if one were coding a level indicator higher than necessary for the base view, that base view (which, coincidentally, can be a fully conformant AVC bitstream suitable for non-multiview decoding) potentially might not be decoded on devices with sufficient computational resources, because the level indicator indicates a higher level (to accommodate multiple view) than necessary for single view decoding.
For both SVC and MVC, one SPS for each layer or view can be inefficient for several reasons. To describe those reasons, briefly described is the parameter set referencing mechanism of H.264. Referring to FIG. 2, shown is the relationship between the slice header, PPS, and SPS. The slice header (201) can contain a variable length (ext-Golomb) coded field (202) that indicates the PPS (203) to be used. For a PPS (203) with the ID 0, that field (202) is 1 bit in length. For values 1 or 2, the field is 3 bits in length.
For values of 3 and larger, it is at least 5 bits in length. Within the PPS, there can be an indication (204) indicating an SPS (205). Inside the SPS (205), there can be a self-reference (206) (which can be used to identify the SPS during its transmission). Note that a video stream can contain many SPSs and PPSs, and, on a per picture or per video sequence basis, the encoder can switch them by coding the appropriate PPS ID (202) in the slice header (201).
As to the first reason, in H.264 it may required to include multiple SPSs (207) in the bitstream (or send them out of band) that may differ only by the level indicator (208). However, many other syntax elements of the SPS can also be necessary to comply with the standard. This can result in redundant transmission of potentially many SPS syntax elements, with the resulting impact to coding efficiency, to enable signaling different levels for different layers or views. Second, the SPS is not directly being referred to from the slice header (or other coded picture high level syntax elements), but rather through one level of indirection: the slice header refers to a PPS, and the PPS refers to the SPS. In order to reference the appropriate SPS, there should be at least one PPS including the reference to the respective SPS. That PPS, in turn, may also differ from other PPSs (209) only by the PPS ID, which may be different because there is a need for a different SPS to signal a different level, as described above. As a result, there may not only be a need for multiple SPSs containing potentially many redundant values as described above, but also many PPSs also with many redundant parameters. Third, in order to signal different PPSs in the slice header, the average length of the (variable length) codeword used for signaling the PPS ID (202) can be longer when more different PPS IDs need to be signaled. Fourth, the codeword in the PPS referring to the SPS (204) is also variable length coded and can be longer when many SPSs are needed. And fifth, the same applies for the self references (206) (210) inside the SPS (205) and PPS (203), respectively.
SVC includes a scalability information SET message (SSEI message). According to H.264 or HEVC, a decoder is not required to decode and act upon substantially all SET messages, including the SSEI message, though not decoding and acting on an SET message may negatively impact user experience. However, a decoder may use information available in a SSEI message found in the bitstream for mechanisms such as resource management, and can rely on the values included in the SET message as being correct. The SSEI message provides, among other things, information about the number of layers present in the coded video sequence. For each of those layers, it can provide, directly or indirectly, a mapping of a layer id value to the priority, dependency, quality and temporal id values, which, in combination, can describe the position of the layer in the hierarchy of layers, as well as many other parameters describing each layer. Some of this additional parameters are optional even within the SET message, including profile and level information, and average bitrate. The profile and level information in the SSEI message can indicate decoding capability for the sub-bitstream associated with the target layer representation identified by the layer_id value. The level limits can be interpreted in the same manner as if the same level indicator value were included in a sequence parameter set. While, when used in this way, the SSEI includes enough information to allow a decoder to obtain profile and level information for each scalable layer, the aforementioned potentially redundant copies of PPS and SPS may be necessitated at the decoder (with the resulting negative effects for coding efficiency) for compliance with H.264.
Similarly, in MVC, the view scalability information SEI message provides information about the number of views present in the coded video sequence, and optionally provides profile and level information for the sub-bitstream associated with a target view representation.
Similar to H.264, HEVC has profile and level indicator syntax elements in the sequence parameter set. The level limits are based directly on the pixel rates (in contrast to H.264's MB rates) but otherwise the functionality is comparable. Table 1 shows the maximum pixel rate and picture size for levels in according to HEVC. Again assuming a fixed frame rate, there is a restriction on pixel throughput, such that the level limit pixel throughput MaxLumaPR>=PicSizeLuma*FrameRate, where PicSizeLuma refers to the size of the luma component of the picture in pixels:
TABLE 1LevelPixel ratepicture size1552,96036,86423,686,400122,880313,762,560458,7523.133,177,600983,040462,668,8002,088,9604.162,668,8002,088,9604.2133,693,4402,228,2244.3133,693,4402,228,2245267,386,8808,912,8965.1267,386,8808,912,8965.2534,773,7608,912,89661,002,700,80033,423,3606.12,005,401,60033,423,3606.24,010,803,20033,423,360Table 1
For scalable and multiview extensions in HEVC, J. Boyce et. al, “High level syntax hooks for future extensions”, January 2012, JCTVC-H0388, available from http://phenix_it-sudparis.eu/jct/doc_end_user/current_document.php?id=4691 and incorporated herein by reference in its entirety, discloses treating the scalable layer and multiview views similarly; that is, as layers. A layer can be identified by its layer_id value. The slice header of slices belonging to a given layer can contain a syntax element to reference a picture parameter set id. Included in the picture parameter set is a syntax element serving as a reference to a sequence parameter set.
In J. Boyce et. al, “Extensible High Layer Syntax for Scalability”, JCTVC-E279, March 2011, available from http://phenix.int-evry.fr/jct/doc_end_user/current_document.php?id=2209 and incorporated herein by reference in its entirety, the sequence parameter set can include a dependency layer parameter set applicable to all layers of a coded video sequence, which can contain additional information about each layer. Such information can include an indication of the reference layer(s) required to decode a particular layer. Temporal layers are considered to be sub-layers and not (full) layers in the aforementioned sense. The result of a sub-bitstream extraction for a target layer can be a sub-bitstream called a layer set, and which can contain the NAL units associated with the particular (target) layer as well as NAL units of all reference layers required for decoding. The extracted layer set is itself a compliant bitstream. Sub-bitstream extraction can consider both a target layer and a target temporal id, resulting in a sub-layer set, which itself is also a compliant bitstream.
A straightforward integration of the aforementioned proposals into HEVC can lead to deficits similar to the ones mentioned in the context of MVC. Specifically, the coding of a level indicator in the SPS can lead to coding inefficiency due to the need of many SPSs differing only slightly (for example only by the level ID syntax element) from each other, which may lead to a need for many PPSs, unnecessary long variable codewords in the slice header to reference those multiple PPSs, and resulting coding inefficiency.
A need therefore exists for an improved techniques for level signaling in layered coding.