Certain commercial video compression techniques can use video coding standards to allow for cross-vendor interoperability. The present disclosure can be used with such a video coding standard, specifically ITU-T Rec. H.264, “Advanced video coding for generic audiovisual services”, 03/2010, available from the International Telecommunication Union (“ITU”), Place de Nations, CH-1211 Geneva 20, Switzerland or http://www.itu.int/rec/T-REC-H. 264, and incorporated herein by reference in its entirety.
An initial version of H.264's was ratified in 2003, and included coding tools, for example a flexible reference picture selection model, that allows for temporal scalability. A subsequent version, ratified in 2007, added in Annex G an extension towards scalable video coding (SVC), including techniques for spatial scalability and quality scalability, also known as signal-to-noise (SNR) scalability. Yet another version ratified in 2009, included in Annex H multi-view coding (MVC).
Earlier versions of H.264 were designed without paying special regards to the requirements of later versions. This has resulted in a number of certain architectural shortcomings, for example in the design of the Network Adaptation Layer (NAL) Unit header, some of which are being addressed by the disclosed subject matter. Co-pending U.S. application Ser. No. 13/343,266, filed Jan. 4, 2012, titled “High Layer Syntax for Temporal Scalability,” the disclosure of which is incorporated by reference herein in its entirety, addresses potential shortcomings at least in the signaling of temporal scalability, while co-pending U.S. provisional patent application Ser. No. 61/451,454, filed Mar. 10, 2012, titled “Dependency Parameter Set for Scalable Video Coding, the disclosure of which is incorporated by reference herein in its entirety, addresses potential shortcomings at least related to the signaling of layer dependencies.
In H.264, a bitstream is logically subdivided into NAL units. Each coded picture is coded in one or more slice NAL units. Many other NAL unit categories are also defined for different types of data, such as, for example, parameter sets, SEI messages, and so forth. In some cases, a NAL unit can be “parsing-independent” in that a loss of a NAL unit may not prevent the meaningful decoding and use of other NAL units. Accordingly, NAL units can be the placed into packets of packet networks subject to packet losses. This use case was one of the motivations for the introduction of the NAL unit concept over the bitstream concept known from earlier video compression standards such as MPEG-2 (ITU-T Rec. H.262 “Information technology—Generic coding of moving pictures and associated audio information: Video”, 02/2000, available from http://www.itu.int/rec/T-REC-H.262, which is also known as MPEG-2 video, incorporated herein by reference).
Throughout the disclosure, syntax table diagrams following the conventions specified in H.264 are being used. To briefly summarize those conventions, a C-style notation is used. A boldface character string refers to a syntax element fetched from the bitstream (which can consist of NAL units separated by, for example, start codes or packet headers). The “Descriptor” column of the syntax diagram table provides information of the type of data. For example, u(2) refers to an unsigned integer of 2 bits length, f(1) refers to a single bit of a predefined value.
FIG. 1 shows a NAL unit header of baseline H.264 and the SVC and MVC extensions. The baseline NAL unit header is part of the NAL unit syntax specification, which is shown (101) with certain parts omitted so not to obscure the disclosure. Specifically, the NAL unit header includes a forbidden_zero_bit (102), two bits indicating the relative importance of the NAL unit for the decoding process (nal_ref_idc, 103), and five bits indicating the NAL unit type (104). For certain NAL unit types, namely types 14 and 20, which are defined as slice types for scalable and multiview coding, as indicated by the if( ) statement (105), a further svc_extension_flag bit (106) is included as well as either (107) a nal_unit_header_svc_extension( ) (108) or a nal_unit_header_mvc_extension( ) (109), as indicated (107) by the svc_extension_flag.
The C-function-style references of nal_unit_header_svc_extension( ) and nal_unit_mvc_extension( ) refer to syntax tables as shown as the SVC NAL unit header extension (110), and the MVC NAL unit header extension (120), respectively.
Of the SVC NAL unit header extension (110), of particular relevance in the context of this disclosure are the following fields:
The priority_id field (111) can be used to linearly signal the relative importance, as determined by the encoder, of a layer relative to other layers of the same scalable bitstream, where a layer can be any of a temporal, spatial, or SNR scalable layer. A dependent layer has a higher priority_id than the layer it depends on. Priority_id is not used by the H.264 decoding process definition, but can be used, for example by decoders or Media-Aware Network Elements (MANEs) to identify NAL units not required for the decoding of a certain layer (where that layer is lower in the layer hierarchy than the layer to which the NAL unit with the high priority_id value belongs). H.264 specifies certain constraints of its value based on the values of dependency_id, quality_id, and temporal_id.
The no_inter_layer-Pred_flag (112) indicates that the layer the NAL unit belongs to is not referring to any other layer for prediction. If set for all NAL units of a given layer, this flag can indicate that the layer can be decoded without regards of any other layer, allowing for techniques such as simulcasting.
The dependency_id field (113) indicates the spatial layer or coarse grain SNR scalable layer the NAL unit belongs to—the higher the value, the higher the layer. Quality_id and temporal_id indicate similar properties for SNR scalable and temporal scalable layers.
The MVC NAL unit extension header (120) includes the following pertinent fields.
Priority_id (121) and temporal_id (123) have similar semantics as described above for the SVC header priority_id (111) and temporal_id (115) fields. View_id identifies one out of up to 1024 “views” of a multiview system, which can, for example, be coded signals from different cameras at different geometric positions capturing the same scene in 3D space. MVC allows for prediction across views, based on the observation that there can be redundancies between views that can be eliminated through prediction.
One goal during the specification of the scalable extension of H.264 has been to allow for the decoding of a scalable base layer by a legacy decoder that was designed before the ratification of the scalable extension, for example by a decoder conforming to any profile of the 2003 version of H.264. For this and other reasons, no backward incompatible changes have been introduced to the base layer syntax. However, there can be certain control information related to, or even affecting, base layer decoding in a scalable coding context (i.e. in conjunction with at least one enhancement layer), that may be not required in context of decoding of the base layer in isolation and, therefore, was not included in, for example the 2003 version of H.264. Some of this information can also be relevant for MVC. Syntax for information of this category was added to the scalable extension of H.264 by, for example, the mechanisms described next.
A first mechanism is the use of different NAL unit types for slice data belonging to scalable or multiview coding, which can trigger the presence of additional fields in the NAL unit header, as already described.
A second mechanism is the introduction of a prefix NAL unit. It uses one of the previously reserved NAL unit types, which means that a legacy decoder that does not recognize the reserved type would ignore its content, whereas a scalable or multiview decoder can interpret its content. The syntax of the prefix NAL unit (201) is shown in FIG. 2. The NAL unit can include a store_ref_base_pic_flag (202) indicating, among other things and only if additional conditions are met, the presence of base picture marking information (203). Though the precise nature of such information may not be particularly relevant for this disclosure, its content is required for the decoding process in a scalable decoding situation.
A third mechanism is known as the scalability_info SEI message. Supplementary Enhancement Information (SEI) messages, as defined in H.264, should not include information required for the decoding process, but are intended for information helpful for a decoder, MANE, or other parts of the overall system layout such as rendering.
The scalability information SEI message can be viewed as a description of the scalable bitstream, including aspects such as description of its layers, inter-layer dependencies, and so forth. The syntax table of the SEI message in the H.264 specification is approximately two pages long. Some parts of it relevant for this disclosure are reproduced in FIG. 3. The scalability information SEI message (301) includes a number of flags concerned with the scalable bitstream (i.e. all layers), which is followed by an integer indicating the number of layers (302). For each of the layers, the following fields are available.
A layer_id (303) field provides for an identification of the layer. It can be used, for example, to cross-reference the layer with other layer descriptions that are located in parts of the SEI message not depicted (such as, for example, inter-layer dependency descriptions). For example, the binding between a dependent layer and the layer it depends on, within the SEI message, is established through layer_id.
The priority_id (304), dependency_id (305), quality_id (306), and temporal_id (307) fields field have a meaning similar to what was already described in the context of the SVC NAL unit header fields with the same name.
All three mechanisms can be described as “bolt-on” to the non-scalable versions of H.264 (versions ratified before 2007). While preserving backward compatibility, the design is generally not characterized as elegant, can incur an unnecessarily high overhead for NAL units and pictures of the scalable extension, and can have error resilience issues.
As an example of unnecessarily high overhead, when using H.264's byte stream syntax, the overhead for a given NAL unit in the bitstream is at least four octets for the startcode. Similarly, when using an IP network and placing a NAL unit into its own packet, the overhead can be 40 octets or more (12 octets for the IP header, 8 octets for the UDP header, and 20 octets for the RTP header). While aggregation techniques as well as header compression techniques can reduce that overhead to a certain extent, reducing overhead further and/or avoiding it altogether would be preferable.
With respect to error resilience, FIG. 4 shows a simplified block diagram of a video conferencing system. An encoder (401) can produce a scalable bitstream (402) comprising NAL units belonging to more than one layer. Bitstream (402) is depicted as a bold line to indicate that it has a certain bitrate. The bitstream (402) can be forwarded over a network link to a media aware network element (MANE) (403). The MANE's (403) function can be to “prune” the bitstream down to a certain bitrate provided by second network link, for example by selectively removing those NAL units belonging to the highest layer. This is shown by the dotted line for the bitstream (404) sent from the MANE (103) to a decoder (405). If the scalable bitstream, (402) contains only NAL units of a base layer and one enhancement layer, then, after pruning, bitstream (404) contains only NAL units of the base layer. The decoder (405) can receive the pruned bitstream (404) from the MANE (403), and decode and render it.
In such an application, the potential unavailability of a scalability information SET message at the MANE very early in the connection—ideally before any coded slice NAL unit of any layer is received by the MANE in bitstream (402), can have negative consequences on the decoder behavior and/or incur unnecessarily high cost in the decoder implementation. For example, without knowing the scalability structure (for example: number of layers and their dependency), the MANE (403) may need to forward, and the decoder (405) may need to buffer and, to the extent possible, decode all NAL units it receives, even those it will not be using for rendering (for example because they belong to a spatial layer of a higher resolution than the display of a handheld device). Similarly, the MANE (403) can have great difficulty in deciding which NAL units to forward to a decoder of limited capability if it receives a scalable bitstream of high complexity (many layers), but is aware that a receiving decoder can process only a non-scalable bitstream or a scalable bitstream of low complexity (few layers). An SEI message, including the scalability information SEI message, can be unavailable because, for example, it was lost during the transmission of bitstream (402), or because the encoder (401) decides not to send the SET message, for example to save the bits of the message (which is conforming to the standard, although not advisable from an application design viewpoint).
A MANE also needs to maintain state, especially with respect to the content of the Scalability Information SEI message, so to make informed decisions about pruning. Such state can be established only by intercepting and interpreting all such SEI messages. While most MANES need to intercept and interpret some bitstream information, such as parameter set information to make meaningful decisions, very few of the numerous SEI messages have any meaning to a MANE. Intercepting all SEI messages just to extract and interpret those few which are meaningful for the MANE can be an onerous and computationally expensive process.
In short, H.264's scalable and multiview syntax related to the NAL unit header contains several potential shortcomings as a result of the “bolt-on” design of the scalable and multiview extension. First, the NAL unit header for NAL units of extension headers can be unnecessarily large. Information pertaining to picture buffer management is sent in its own NAL unit (the prefix NAL unit), which can incur unnecessarily high overhead. In addition, information important for certain applications (such as information carried in the scalable information SEI message) are carried in SEI messages, which (a) does not reflect their critical nature (SEI can be discarded), and (b) may require unnecessary duplication of some information in, for example, the NAL unit header (such as: dependency_id, quality_id, view_id).
Recently, a High Efficiency Video Coding (HEVC) has been considered for standardization. A working draft of HEVC can be found at (B. Bross et. al., “WD4: Working Draft 4 of High-Efficiency Video Coding”, available from http://wftp3.itu.int/av-arch/jctvc-site/2011—07_F_Torino/), referred to as “WD4” henceforth, which is incorporated herein by reference. HEVC inherits many high level syntax features of H.264. It can be advantageous to the success of HEVC if the potential shortcomings of H.264 described above were addressed before the standard is ratified.