Scalable video coding refers to techniques where a base layer is augmented by one or more enhancement layers. When base and enhancement layer(s) are reconstructed jointly, the reproduced video quality can be higher than if the base layer is reconstructed in isolation.
Multiview coding refers to techniques where more than one “view” is coded in its own video sequence, and the combined decoding of both sequences (in conjunction with appropriate rendering) can offer a stereoscopic or other 3D-type viewing effect.
There can be other forms of coding techniques where the association and/or relationship of several video sequences is important for joint decoding and/or rendering, for example multiple description coding.
In the following, the description refers to scalable coding for convenience.
In scalable video coding, many forms of enhancement layer types have been reported, including temporal enhancement layers (that increase the frame rate), spatial enhancement layers (that increase the spatial resolution), and SNR enhancement layers (that increase the fidelity, that can be measured in a Signal to Noise SNR ratio).
Referring to FIG. 1, in scalable video coding, the relationship of layers can be depicted in the form of a directed graph. In the example presented, a base layer (101) (that can be, for example, be in CIF format at 15 fps) can be augmented by a temporal enhancement layer (102) (that can, for example increase the frame rate to 30 fps). Also available can be a spatial enhancement layer (103) that increases the spatial resolution from CIF to 4CIF. Based on this spatial enhancement layer (103), another temporal enhancement layer can increase the frame rate to 30 fps.
In order to reconstruct a 4CIF, 30 fps signal, all base layer (101), spatial enhancement layer (103), and second temporal enhancement layer (104) should be present. Other combinations are also possible, as indicated in the graph.
Layering structure information can be useful in conjunction with network elements that remove certain layers in response to network conditions. Referring to FIG. 2, shown is a sending endpoint (201) which sends a scalable video stream (that may have a structure as described before) to an application layer router (202). The application layer router can omit forwarding certain layers to endpoints (203), (204), based on its knowledge of the endpoints' capabilities, network conditions, and so on. U.S. Pat. No. 7,593,032, incorporated herein by reference in its entirety, describes exemplary techniques that can be used for the router.
The information in each layer can be coded according to ITU-T Rec. H.264, “Advanced video coding for generic audiovisual services”, March 2010, available from the International Telecommunication Union (“ITU”), Place de Nations, CH-1211 Geneva 20, Switzerland or http://www.itu.int/rec/T-REC-H.264, and incorporated herein by reference in its entirety, and, more specifically, to H.264's scalable video coding (SVC) extension, or to other video coding technology supporting scalability, such as, for example, the forthcoming scalable extensions to “High Efficiency Video Coding” HEVC, which is at the time of writing in the process of being standardized. At the time of this writing, the current working draft of HEVC can be found in Bross et. al, “High Efficiency Video Coding (HEVC) text specification draft 6” February 2012, available from http://phenix.it-sudparis.eu/jct/doc_end_user/documents/8_San %20Jose/wg11/JCTVC-H1003-v21.zip.
According to H.264, the bits representing each layer are encapsulated in one or more Network Adaptation Layer units (NAL units). Each NAL unit can contain a header that can indicate the layer the NAL unit belongs to.
However, without observing multiple NAL units belonging to each and every one of the layers, analyzing their content, and, thereby, building a “picture” of the layers available, a router lacks mechanism to derive the layering structure as described above. Without knowledge of the layering structure, a router may not make sensible choices for removing NAL units belonging to certain layers.
This situation was identified during the development of SVC, and an SEI message was introduced that describes the layering structure. SEI messages can have the disadvantage that network elements, according to H.264, have the freedom to remove them from the bitstream, as they are not required for the decoding process. If an intermediate network element (205), depicted here in dashed lines, were to remove the SEI messages, the router may not quickly obtain the layering structure and would have to fall back to observing all NAL units and their content.
Although not critical, the layering structure should be known before the first bit containing video information arrives at the router. The SVC payload format for SVC, (Wenger, Wang, Schierl, Eleftheriadis, “RTP Payload Format for Scalable Video Coding”, RFC 6190, available from http://tools.ietf.org/html/rfc6190), incorporated by reference herein in its entirety, includes a mechanism to integrate the SEI message containing the layering structure in the capability exchange messages, for example using the Session Initiation Protocol (Rosenberg et. al., “SIP: Session Initiation Protocol” RFC 3261, available from http://tools.ietf.org/html/rfc3261) and incorporated by reference herein in its entirety. However, decoding an SEI message requires bit oriented processing of video syntax, something a router is not often prepared to do. Further, intercepting the SEI message coded as part of the session signaling (in contrast to being coded in the bitstream) generally requires the router to be in the signaling pass, which, for some routers, may not be a sensible, cost-effective option.
Accordingly, there is a need for a data structure that does a) not require difficult bit oriented processing, b) is available, as part of the video bitstream, early in the bitstream transmission, and c) cannot be removed by an intermediary network element without making the video bitstream non-compliant.