H.264, also known as MPEG-4/advanced video coding (AVC), is the state of the art video coding standard. It is a hybrid codec which takes advantage of eliminating redundancy both within each video frame and between frames. The output of the encoding process is video coding layer (VCL) data which is further encapsulated into network abstraction layer (NAL) units prior to transmission or storage. Apart from video data, other data that can be carried in NAL units includes parameter sets, such as sequence parameter sets (SPS) and picture parameter sets (PPS), which carry data that is essential for decoding of VCL data, such as video resolution or required decoder capabilities, or supplemental enhancement information (SEI) that carries information that can be useful for decoders or network elements, but is not essential for decoding VCL data.
The NAL is designed in order to enable simple, effective, and flexible use of the VCL for a broad variety of systems for transport and storage of video data, such as transmission over real-time transport protocol (RTP) or hypertext transport protocol (HTTP), or storage in ISO file formats. The NAL unit concept is supposed to provide a means for networks, i.e., transmission and storage systems, to access, group, and manipulate, compressed bit streams by splitting the bit streams into logical units. For instance, a unit corresponding to one compressed picture is augmented with high-level information indicating to the network whether a coded picture can be used as random access point to start decoding of the compressed video.
NAL is the minimum-size functional unit for H.264/AVC video. A NAL unit can be subdivided into NAL unit header and NAL unit payload. The NAL unit header consists of a set of identifiers that can be used by networks to manage the compressed bit streams. For example, in order to reduce the transmission bit rate of a video in case of limited bandwidth, some NAL units can be discarded based on information carried in the NAL unit headers, so as to minimize the quality degradation caused by discarding video data. This process is denoted as “bit stream thinning”.
While traditional video services provide video in a single representation, i.e., using fixed camera position and spatial resolution, multi-resolution and multi-view video representations have recently gained importance. A multi-resolution representation represents the video in several different spatial resolutions, so as to serve target devices with different display resolutions. A multi-view representation represents the content from different camera perspectives, a particular case being the stereoscopic video case, where the scene is captured by two cameras with a distance similar to that of the human eye. Using suitable display technologies, perception of depth can be provided to a viewer.
Multi-resolution and multi-view video representations are often referred to as hierarchical or layered representations, where a base layer represents a basic quality of the video, and successive enhancement layers amend the representations towards higher qualities.
Scalable video coding (SVC) and multi-view video coding (MVC) are video coding standards that can be used to compress multi-resolution and multi-view video representations, respectively, where high compression efficiency is achieved by eliminating redundant information between different layers. SVC and MVC are based on the AVC standard, and included as Annexes G and H in the later editions of AVC, and consequently share most of the AVC structure.
The hierarchical dependencies inherent to SVC and MVC bit streams require additional information fields in the NAL unit headers, such as decoding dependencies and view identifiers. However, in order to retain compatibility with existing AVC implementations, the basic AVC NAL unit header was not changed. Instead, the extra information, such as dependencies and view identifiers, was incorporated by introducing two new types of NAL units, namely a prefix NAL unit (type 14) and a coded slice extension NAL unit (type 20), that are defined as “unused” in AVC and thus ignored by AVC decoders which do not support Annex G or H of the specification.
A prefix NAL unit can be associated with a VCL AVC NAL unit which is supposed to follow immediately after the prefix NAL unit in the bit stream, conveying additional information pertaining to the base layer. AVC decoders will ignore the prefix NAL units and can thus decode the base layer.
A coded slice extension NAL unit is used only in SVC or MVC enhancement layers. It represents enhancement information relative to the base layer or other enhancement layers. Besides conveying dependencies and view identifiers as in the prefix NAL unit, a coded slice extension NAL unit consists both of an SVC or an MVC NAL unit header, as well as corresponding VCL data. Thus, it is a combination of a prefix NAL unit and a VCL AVC NAL unit. SVC and MVC enhancement layer NAL units will be ignored by AVC decoders.
SVC and MVC extensions of AVC are defined in a similar way. Their use is mutually exclusive, i.e., the syntax and semantics defined in the standard are partly conflicting and do not allow using SVC and MVC elements simultaneously. Combining features from SVC and MVC would require changes to the standard, and in particular to the definition of the NAL unit header. HEVC is a next generation video coding standard that is currently undergoing standardization. HEVC aims to substantially improve coding compared to AVC, especially for high-resolution video sequences.
In terms of high-level syntax design, the most straightforward method is to adopt the concept of AVC high-level syntax, in particular the AVC NAL unit concept. However, this may suffer from the following problems.
According to state of the art, SVC and MVC are built up from AVC in a backward compatible manner. The new NAL unit type 20 is designed with header extension that can be used for any enhancement layer. To solve legacy AVC decoder issues, the old NAL units (type 1, type 5, and other types) are kept and a prefix NAL unit association method is used for each normal AVC VCL NAL unit (type 1 and type 5). While this approach could in principle be taken for HEVC and its later extensions, it has the following problems associated with it.                The introduction of new features or functionality requires definition of new NAL unit types, e.g., coded slice extension NAL units. This may be undesirable since the maximum number of NAL unit types is typically limited, e.g., by the defined length of the NAL unit type field.        In order to take legacy decoders into consideration, a base layer must be created with a legacy NAL unit type with a prefix NAL unit which results in a second new NAL unit type that should be designed, thus further increasing the number of NAL unit types.        The signaling of base layer and enhancement layers is not uniform and requires special treatment through the network for each layer, leading to complex implementations. The use of prefix NAL units is unnatural and provides only a weak link between the necessary header information and the corresponding VCL data. This link may easily break down if, e.g., one of the NAL units is lost in the transmissions.        In case of future extensions, nesting of prefix NAL units is complicated.        By extending the high-level interface through additional NAL unit headers, network functionalities that are supposed to process NAL units based on the information conveyed in the NAL unit headers have to be updated each time the NAL unit headers are extended.        
Further problems associated with the state of the art AVC concept are related to the layered representation. Currently, in SVC and MVC, all the flags related to with layer properties, such as view_id, dependency_id, and quality_id, are simply put into NAL unit headers without any intellectual selection or categorization. This requires a client that is receiving the bit stream to have detailed knowledge about the definition of the flags, e.g., if the client wants to prune or manipulate the bit stream. Basically, the client is required to fully understand the meaning of each flag and how they interrelate. Erroneous action may easily be taken, e.g., when one view needs to be extracted from a multi-view bit stream, if the views which it depends on are not included, or a low quality version is selected if a client only considers the view_id flag. Even with some assistance from SEI elements there may be cases where it is very complex for the network to find and understand all the necessary information that is needed to extract a certain video representation from the layered bit stream.
Further, with more and more applications and standards covering 3D, new data elements, such as depth maps and occlusion maps, will be transmitted together with texture, allowing for more flexible rendering of output views at the receiving end. Since such elements form layered representations together with the (multi-view or scalable) “texture” video, it may be desirable to transmit all in the same bit stream. Such bundling of different data elements may alternatively be achieved through signaling on higher system levels, such as transport protocol or file format. However, since software and hardware implementations of such higher-level protocols are often separated from implementations of the video decompression, the exact temporal synchronization of different data elements, such as synchronization of texture with depth, may be very complex if not supported on the bit stream level. Note that the synchronization of different video data elements, such as texture and depth, must be much tighter than the synchronization of video and audio, since the different video elements must be frame aligned. Additionally, video elements, such as texture and depth may be compressed together, e.g., by re-using motion information (“motion vectors”) among them, which requires tight coupling on the bit stream level.
The initial focus of the HEVC development is on mono video. However, later extensions towards scalable coding and/or multi-view coding are likely. It is also likely that a packetization concept similar to the NAL unit concept in AVC will be used. Thus, in the following, even though the presented methods are applicable primarily to future video coding standards such as HEVC, the term “NAL unit” will be used in the same sense as it is defined in AVC. Also other AVC concepts such as SPS, PPS, and SEI, are expected to be used in HEVC, and their AVC terminology is therefore used in the following, although they may be called differently in HEVC or any other future video coding standard.