High Efficiency Video Coding (HEVC) is a next generation video coding standard which is currently under standardization. HEVC aims to substantially improve coding efficiency compared to state-of-the-art video coding, such as H.264/AVC (also known as MPEG-4 AVC), in particular for high resolution video content.
The initial focus of the HEVC standardization is on mono video, i.e., one camera view. However, given the relevance of multi-resolution and multi-view 3D representations, extensions towards scalable coding and multi-view video as well as depth-map coding are planned or ongoing. Those extensions require multi-layer support.
An HEVC bitstream without extensions can be considered as a single-layer bitstream, i.e., it represents the video in a single representation, e.g., as a single view with single resolution and single quality. In multi-layer extensions, an HEVC single-layer bitstream is typically included as a “base layer”. In multi-view 3D extensions, additional layers may represent additional video views, captured from different camera positions, depth information, or other information. In scalability extensions, additional layers may represent the video in additional, higher, video picture resolutions, higher pixel fidelity, alternative color-spaces, or alike, providing improved video quality in comparison to the base layer.
HEVC uses a video packetization concept denoted as Network Abstraction Layer (NAL) unit concept. A compressed video bitstream consists of a sequence of NAL units, representing a coded video sequence. Each NAL unit can carry coded video data, so called Video Coding Layer (VCL) data, or parameter data needed for decoding, so called Parameter Sets (PS), or supplementary data, so called Supplementary Enhancement Information (SEI). Each NAL unit consists of a NAL unit header and a NAL unit payload. The NAL unit header consists of a set of identifiers that can be used by networks to manage the compressed bit streams. For example, in order to reduce the transmission bitrate of a video in case of limited bandwidth in the network, some NAL units can be discarded based on information carried in the NAL unit headers such as to minimize the quality degradation caused by discarding. This process is referred to as “bitstream thinning”.
Parameter sets are syntax structures containing parameters needed in the decoding process, such as the decoder profile (i.e., the mode of operation specifying the supported decoding algorithms) and level (specifying implementation limits such as maximum supported picture size, frame rate, and bit rate), the video picture dimensions (width and height of the video picture), and parameters related to configuration of algorithms and settings which are necessary for decoding the compressed bitstream. Several different types of parameter sets exist, in particular Sequence Parameter Sets (SPS), Picture Parameter Sets (PPS), and Adaptation Parameter Sets (APS). Introduction of a further parameter set, the Video Parameter Set (VPS), has been considered.
The SPS contains parameters that change very infrequently and are typically valid for a complete video sequence. The PPS contains parameters that may change more frequently than SPS parameters, but typically not very frequently. The APS contains information that typically changes frequently, e.g., with every coded picture. In the envisioned scalability and 3D extensions to HEVC, it is likely that these PS concepts will be re-used, and PSs will be present in different layers. In that context, the VPS has been proposed to contain information which applies identically to several or all layers and changes infrequently. Parameter sets typically have an identifier (PS ID) by which they can be referred to. Further parameter sets, such as Group Parameter Sets (GPS), are under discussion.
In the HEVC decoding process, PSs are “activated” when they are referred to by NAL units that contain coded slices, i.e., coded video data. When a PS is active, the values of syntax elements, i.e., parameters comprised in the PS are accessible by the decoder in the decoding process. The known activation mechanisms for PSs are outlined in the following:                A PPS which is referenced in a slice header, i.e., by a parameter field in a coded slice, is activated when the coded slice is decoded. Zero or one PPS can be active at each time.        SPSs are referenced by PPSs. When a PPS is activated the referenced SPS is activated, too. Zero or one SPS can be active at a time.        APSs which are referenced in a slice header are activated when the slice is decoded, similar to PPSs.        A VPS (not in the current HEVC draft, but under discussion) is activated when an SPS comprising a reference to the VPS is activated.        Alternatively, a GPS, which has been proposed, would replace the activation processes for APS, PPS, and SPS. A GPS would be activated if a slice having a reference to the GPS in its header is decoded. The GPS may include references to a PPS, SPS, zero, one, or several APS, and potentially a VPS. When the GPS is activated, other PSs referenced in the GPS may be activated, too.        
In the HEVC 3D extension test model under consideration in the Moving Picture Experts Group (MPEG), several video and depth views can be included in a coded video sequence, or bitstream, where each video and depth view is associated with a separate SPS. Thus, for the case of a 2-view (3-view) video and depth representation, a total of four (six) SPSs need to be sent for each random access point, i.e., a point in the bitstream where decoding typically starts. The SPSs associated with the different video and depth views have a high similarity, since the video dimensions are identical and typically the same or almost the same set of coding algorithms is used across views. By duplicating this information in several SPSs, unnecessary redundancy is introduced, amounting to typically around 100 or several 100 bits per SPS.
One approach for reducing redundancy, and thereby overhead, in signaling parameter sets, is to re-use parameter sets, such as SPSs, in several layers. The effectiveness of such approach is however limited since, even though many SPS syntax elements typically have identical values across layers, some SPS syntax elements still have different values, e.g., the syntax elements profile_idc and level_idc, which indicate the decoder profile and level, respectively. For a 2-view video and depth representation, it is likely that at least three different profiles will be signaled, e.g., an HEVC main profile associated with the base video view (i.e., the base layer), a stereoscopic 3D profile associated with the base view and enhancement video views, and a 2-view video and depth 3D profile, each of them being associated with a different value of profile_idc. Similarly, for 3-view video and depth representations, it may be desirable to include the abovementioned three profiles in order to support mono-capable (i.e., single-layer video) decoders, stereo-capable (video only, as well as and video and depth) decoders, and to add at least one additional profile for 3-view video and depth. Thus a total of at least four different values of profile_idc have to be signaled. Even if it is not necessary to signal a new profile with each layer, it may be desirable to signal different level requirements for the different layers. For the same reasons, it is likely that spatial and/or fidelity scalability extensions will associate different profiles and/or levels with the different layers, with many SPS syntax elements being identical across layers while others not. Thus, even when re-using SPSs across layers, redundancy will remain.