Commercial video compression techniques can use video coding standards to allow for cross-vendor interoperability. One such video coding standard is ITU-T Rec. H.264, “Advanced video coding for generic audiovisual services”, March 2010, available from the International Telecommunication Union (“ITU”), Place de Nations, CH-1211 Geneva 20, Switzerland or http://www.itu.int/rec/T-REC-H.264, and incorporated herein by reference in its entirety.
H.264 allows for temporal scalability through a technique known as reference picture selection. Reference picture selection can allow, with a few limitations (such as: no references to pictures decoded before (in decoding order) the latest IDR picture), at the time of reconstruction of a given picture, inter picture prediction from any reference picture in storage at the decoder. The number of reference pictures stored in the decoder can be limited by profiles and levels of H.264. Further, the encoder during bitstream generation can explicitly signal, for each picture, whether it should be stored as a reference picture. In the absence of explicit signaling, some pictures are also stored implicitly. The combination of explicit signaling and implicit storage can allow for flexibility of reference picture management at low bitrate overhead.
In practice, however, certain H.264 encoders create bitstreams in which the referencing relationship of coded pictures, also known as temporal picture coding structures, do not exercise the full flexibility H.264 allows, but instead follow certain “patterns”. One crude form of those patterns were known as Group Of Pictures, or GOPs, as known from, for example, ITU-T Rec. H.262 “Information technology—Generic coding of moving pictures and associated audio information: Video”, February 2000, available from http://www.itu.int/rec/T-REC-H.262, which is also known as MPEG-2 video, and incorporated herein by reference. FIG. 1 shows such a pattern known as IBBP pattern, that is deployed in MPEG-2 and H.264 based broadcasting systems. A temporal base layer (101) includes Intra/IDR (I-) pictures (103) (104) and Predictively coded (P-) pictures (105). The I picture frequency—the inverse of the temporal distance between two I pictures such as pictures (103) and (104)—can be set by the encoder based on application demands (tune-in time for broadcast), and is often in the sub-second range. A temporal enhancement layer (102) can consist entirely of bi-predicted (B-) pictures (106) (107) with prediction relationships to the temporally closest I- or P-pictures. Prediction relationships are shown by arrows (108); the each arrow originates from a picture that is being predicted and points to the picture from which prediction information is taken.
According to MPEG-2, no multiple reference picture prediction mechanisms beyond the constrained mechanism of B frames were available, which limited the possible prediction relationships. In H.264, in contrast, prediction relationships can be more complex in two dimensions. First, inter picture prediction can be possible not only from the temporally closest or P picture (in case of a P picture that is predicting), or the two temporally closest, in the past or in the future, I or P pictures (in case of B pictures), but also from temporally distant pictures of any type. Second, while a given macroblock with a predicted (P-) or bi-predicted (B-) slice can reference only content of one or two different pictures for P or B coded macroblocks respectively, different macroblocks can reference different pictures even if they are located in the same slice.
FIG. 2 shows two patterns (201) (202) possible in H.264. Each of the two different patterns uses three different layers and each uses only I and P pictures.
The H.264/AVC JM reference software, as described in, for example, A. M. Tourapis, K. Sühring and G. Sullivan, “H.264/14496-10 AVC Reference Software Manual (revised for JM17.1),” JVT-AE010 revised, .IVT-Manual, London, UK, June, 2009, available from http://wftp3.itu.int/av-arch/jvt-site/2009—06_London/JVT-AE010.zip which is incorporated herein by reference, provides a mechanism for describing, in the encoder configuration file, the temporal picture coding structure for the encoder to generate. Many coding structures can be described using the “ExplicitHierachyFounat” parameter. The encoder uses these configuration parameters during its encoding, but does not explicitly encode the parameters inside, or along with the generated bitstream. A decoder, therefore, may not have a mechanism available to obtain the coding structure, without deriving it from the bitstream through deep bitstream inspection.
Sub-sequence information, sub-sequence layer characteristics, and sub-sequence characteristics SEI messages in the H.264/AVC standard provide some information about the coding structure, with fields to provide average frame rate and average bit rate for each sub-sequence layer. However, they do not have explicit notion of temporal layering and the coding/display order of each picture is not specified. Additionally, the SET messages have to be sent frequently for each picture or for each repeating structure.
The Scalability information SEI message in the H.264 SVC extension provides some means to describe the coding structure using layer_dependency_info_present_flag and associated syntactic elements, It also includes frame rate and bit rate information. However, the scalability information SEI message does not have enough information to fully identify temporal coding structures. For example, referring to FIG. 2, the scalable information SEI message cannot distinguish between the two coding structures depicted (201) (202).
As described, none of the aforementioned SEI messages, alone or in combination, describes a temporal coding structure fully in such a way that a decoder can use it for, for example, resource allocation purposes. Further, even if a Media-Aware Network Element such as a bitstream extractor or transrator were to intercept all SEI messages, it would not have all information available to meaningfully identify pictures (more precisely: NAL unit belonging to pictures) that can be removed from a scalable bitstream when pruning a scalable bitstream. Additional details regarding the bitstream extractor and transrator are described later.
A working draft of The High Efficiency Video Coding HEVC can be found at (B. Bross et. al., “WD4: Working Draft 4 of High-Efficiency Video Coding”, available from http://wftp3.itu.int/av-arch/jctvc-site/2011—07_F_Torino/), referred to as “WD4” henceforth, which is incorporated herein by reference. HEVC inherits many high level syntax features of H.264. It can be advantageous to the success of HEVC if the potential shortcoming of H.264 described above were addressed before the standard is ratified.
A mechanism is therefore required that enables an encoder to place into a video bitstream a representation of a temporal coding picture structure or pattern, such that a decoder or a MANE can easily intercept and decode the representation and use it for, for example, transrating, or bitstream extraction in MANEs, or resource management in decoders.