Commercial video compression techniques can use video coding standards to allow for cross-vendor interoperability. The present disclosure can be used with such a video coding standard, specifically ITU-T Rec. H.264, “Advanced video coding for generic audiovisual services”, March 2010, available from the International Telecommunication Union (“ITU”), Place de Nations, CH-1211 Geneva 20, Switzerland or http://www.itu.int/rec/T-REC-H.264, and incorporated herein by reference in its entirety.
H.264 allows for temporal scalability through a technique known as reference picture selection. Reference picture selection can allow, with a few limitations (such as: no references to pictures decoded before (in decoding order) the latest IDR picture), at the time of reconstruction of a given picture, inter picture prediction from any reference picture in storage at the decoder. The number of reference pictures stored in the decoder can be limited by profiles and levels of H.264. Further, the encoder during bitstream generation can explicitly signal, for each picture, whether it should be stored as a reference picture. In the absence of explicit signaling, some pictures are also stored implicitly. The combination of explicit signaling and implicit storage can allow for flexibility of reference picture management at low bitrate overhead.
In practice, however, certain H.264 encoders create bitstreams in which the referencing relationship of coded pictures follow certain “patterns”. One crude form of those patterns were known as Group Of Pictures, or GOPs, as known from, for example, ITU-T Rec. H.262 “Information technology—Generic coding of moving pictures and associated audio information: Video”, February 2000, available from http://www.itu.int/rec/T-REC-H.262, which is also known as MPEG-2 video, and incorporated herein by reference. FIG. 2 shows two examples of patterns implementable with H.264; a more detailed description is provided later.
Within a pattern, the decoding of certain pictures can be more relevant than the decoding of others, both from a bitstream compliance, and a user experience perspective. For example, the non-availability for decoding of an IDR picture—which is in some cases the first picture of a pattern—can have negative consequences for the decoding of the rest of the pattern. On the other hand, the non-availability of a picture that is not used for reference only leads to the lack for presentation of that very picture, which can be perceived by the user as a temporary drop in frame rate, and can in some cases be concealed. The consequences of non-decoding of pictures other than IDR pictures and non-reference pictures can be moderate in terms of severity of consequences, as described later.
Referring to FIG. 1, shown is a simplified block diagram of an exemplary video conferencing system. An encoder (101) can produce a bitstream (102) including coded pictures with a pattern that allows, for example, for temporal scalability. Bitstream (102) is depicted as a bold line to indicate that it has a certain bitrate. The bitstream (102) can be forwarded over a network link to a media aware network element (MANE) (103). The MANE's (103) function can be to “prune” the bitstream down to a certain bitrate provided by second network link, for example by selectively removing those pictures that have the least impact on user-perceived visual quality. This is shown by the hairline line for the bitstream (104) sent from the MANE (103) to a decoder (105). The decoder (105) can receive the pruned bitstream (104) from the MANE (103), and decode and render it. By pruning only those pictures that are either not used for reference at all, or used for reference only by a subset of the remaining pictures of the pattern (which advantageously are also removed), the visual quality can be kept high even considering the reduction of bitrate.
Bitstream pruning in the aforementioned sense is an operation that can be processed in the compressed domain. In contrast to transcoding (which involves at least partial bitstream reconstructions and encoding), bitstream pruning can be a computationally lightweight and virtually delay-neutral operation.
Bitstream pruning can occur in all of encoder (101), MANE (103), and decoder (105). The key use case for MANE (103) based pruning has already been described. In a decoder (105), pruning can be sensible when computational resources are not available to decode all layers received in bitstream (104)—which can, for example, be the case when there is no call control protocol in which a decoder (105) can advise a MANE (103) or encoder (101) of its capabilities. Broadcast transmissions of multiple layers are one practical scenario. Bitstream pruning in the encoder (101) can occur, for example, when the signal processing entities of the encoder (101) are incapable of adapting to an network bitrate—that is: always encode a high bitrate with several layers, as dictated, for example, by an inflexible hardware architecture, but the network bitrate available for bitstream (102) changes to a value lower than required to transport all the bits, and the transport part of the encoder (101) becomes aware of this situation.
Even assuming that an encoder uses a certain pattern, the high number of potential patterns (limited only by constraints such as the maximum number of reference pictures) can create a difficulty for an encoder, decoder, or MANE when it needs to identify those pictures it needs to skip decoding, forwarding, or otherwise handling.
When H.264 and its scalable extension Annex G were designed, this problem was to some measure addressed by certain mechanisms described below.
In bitstreams compliant with H.264 a decoder or MANE can use a syntax element in the NAL unit header known as nal_ref_idc to indicate a picture not being used as a reference picture. Similarly, the nal_unit_type can indicate an IDR picture. These two signaling techniques cover the two most outlying cases: IDR pictures that, in most cases are required for the decoding of all other pictures of a pattern (highest importance), and non-reference pictures, that are not required for decoding of any other picture of the pattern (lowest importance). Both mechanisms are available with or without the use of Annex G.
The Scalable Video Coding (SVC) extension to H.264, specified in Annex G, provides further aid in identifying pictures of a pattern that can be pruned. Specifically Annex G introduces, among other things, the concept of a temporal layer. Referring to FIGS. 2a and 2b, shown are two different patterns implementing temporal scalability.
FIG. 2a shows a pattern (201) that includes three pictures (202-204). Picture (202) belongs to the base layer (206), and is predicted only from a previous base layer picture (205). Prediction relationships are shown by arrows. Two temporal enhancement layer (207) pictures (203, 204) are predicted from base layer picture (202) and from layer 1 picture (203), respectively. No base layer picture is predicted from enhancement layer pictures. Further, no prediction occurs between pictures in pattern (201), and other patterns, with the exception of the base layer prediction.
FIG. 2b shows a pattern using three temporal layers: base layer (210), and a first (211) and second (212) temporal enhancement layer. Pattern (213) includes four pictures, of which picture (214) is a base layer picture, picture (215) belongs to the first enhancement layer (211), and pictures (216) and (217) belong to the second enhancement layer (212).
The prediction relationships shown in FIGS. 2a and 2b are those normally associated with predicted (P−) pictures, in contrast to Intra (I−) pictures or bi-predicted (B−) pictures. Further, no multi-prediction (in the sense that different blocks of a picture can have different temporal prediction relationships) is shown in the FIGs. All above options exists in at least some profiles of H.264. For the description below, sometimes those features are omitted so to not to obscure the more relevant aspects of the present disclosure. A person skilled in the art is able to generalize to the description to different picture types and multi-prediction.
According to H.264 Annex G, there is a temporal_id field in the NAL unit header extension, which is present only for enhancement layer NAL units compliant with Annex G. The purpose of the temporal_id field is to indicate the temporal layer to which the NAL unit belongs. The presence of this information is required for the bitstream to be compliant, but it should not have any direct impact on decoding process. In other words, at least NAL units belonging to an enhancement layer have information included that signals to the decoder the temporal layer the picture belongs to.
The SVC extension further includes a Scalability Information SEI message. The Scalability Information SEI message includes information about the scalability structure of the bitstream, which can also be viewed as a the description of a pattern. The Scalability Information SEI message can be used to indicate, among other things, dependencies between temporal layers, which are defined by the temporal_id syntax element described above. In other words, by receiving and interpreting the Scalability Information SEI message, a decoder can learn how many temporal layers it can expect in the scalable bitstream.
This provides for information that can be useful, but is not strictly required, for the decoding process (but may be critical for other mechanisms not defined in H.264, such as rendering, bitstream pruning, selective decoding, and so forth).
The Scalability Information SEI message further includes a temporal_id_nesting_flag. Informally put, the temporal_id_nesting_flag, when set, indicates that there is no prediction relationship between pictures of a higher different layer “across” a picture of a lower temporal layer. For the full definition it is referred to H.264 Annex G. The patterns of FIG. 2a and FIG. 2b fulfill this condition, as do all four patterns (also known as coding structures) of FIG. 3a-d. FIG. 3a shows a traditional IPPP coding structure, with only a single temporal layer. Since there are no temporal layers, the value of temporal_id_nesting_flag is irrelevant. FIG. 3b depicts a IBBP structure as commonly used in MPEG-2 based broadcast environments. The pictures of the temporal enhancement layer 1 (B pictures) are using only the I pictures and P pictures of the base layer for reference. FIG. 3c shows a three layer coding structure using B pictures for the enhancement layers. Such a coding structure is implementable using H.264. FIG. 3d shows a hierarchical three layer P picture based coding structure as used in some video conferencing systems.
Encoders, decoders and MANEs can use the information in the Scalability information SEI message to determine the presence of temporal layers in the bitstream, and to determine to what extent those layers are properly “nested” in each other, in that there is no picture of a higher layer is used as a reference by a picture of a lower layer. This information can be used for bitstream pruning without deep analysis of the bitstream. For example, if the temporal_id_nesting_flag is set, and the Scalability Information SEI indicates that temporal_id=2 is the highest temporal layer, a MANE or a decoder can safely remove from a pattern all NAL units with temporal_id equal to 2 without breaking any prediction in layers 0 and 1, or it can remove all NAL units with temporal_id 2 or 1, without breaking any prediction in layer 0.
The SVC extension further provides a tl_switching_point SEI message, with a delta_frame_num syntax element, to provide information about the relative position in frames when a switching point will be present. If the SEI message is used, the bitstream is restricted such that a particular temporal layer may not use any previously coded higher temporal layer for decoding.
The presence of the information in this SEI message can enable the decoder to switch how many temporal layers to decode, in particular to begin decoding additional temporal layers at switching points.
The aforementioned mechanisms allow for efficient bitstream pruning of higher temporal layers, and layer switching between temporal layers if, and only if, the Scalability information and tl_switching_point SEI messages are available in all MANEs involved in the transmission and/or pruning of the bitstream, and at the decoder. However, SEI message NAL units have the nal_ref_idc syntax element set to 0, indicating that a MANE or a decoder can ignore such information without violating standards compliance. Accordingly, a MANE not specifically concerned with scalable bitstreams (for example because it is a legacy device that was designed before the scalable extension of H.264 was standardized), but in a need to “prune” a bitstream (for example because of insufficient bandwidth on its outgoing link) is likely to remove the SEI message among with other NAL units with nal_ref_idc set to 0, such as non-reference pictures. As a result other MANEs or decoders further downstream may not easily (without deep bitstream inspection) remove temporal layers.
A MANE also may be required to maintain state, especially with respect to the content of the Scalability Information and tl_switching_point SEI messages, so to make informed decisions about pruning. Establishing such state can require intercepting and interpreting all, or substantially all such SEI messages. While most MANEs need to intercept and interpret parameter set information to make meaningful decisions, very few of the numerous SEI messages have any meaning to a MANE. Intercepting all SEI messages just to extract and interpret those few which are meaningful for the MANE can be an onerous and computationally expensive process.
Further, temporal scalability (in contrast to other forms of scalability) can be implemented using the pre-Annex G version of H.264 (profiles such as baseline, main, or high profile). However such profiles can lack the functionality of the aforementioned SEI messages.
Accordingly, one shortcoming of Annex G of H.264 can be that the information mentioned should be available in syntax elements less readily discarded, and less obscured by other information, than SEI messages.
Currently in the process of standardization is High Efficiency Video Coding (HEVC). A working draft of HEVC can be found at (B. Bross et. al., “WD4: Working Draft 4 of High-Efficiency Video Coding”, available from http://wftp3.itu.int/av-arch/jctvc-site/2011_07_F_Torino/), referred to as “WD4” henceforth, which is incorporated herein by reference. HEVC inherits many high level syntax features of H.264. It can be advantageous to the success of HEVC if the shortcomings of H.264 described above were addressed before the standard is ratified.
There is a need for techniques that allow for signaling of information related to temporal scalability in a manner that makes intentional removal of scalability information (such s an SEI message) by legacy MANEs, baseline H.264 decoders, and HEVC capable MANEs and decoders difficult or impossible without losing conformance with the video coding standard, while still maintaining overall design integrity.