This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
Video coding standards include ITU-T H.261, ISO/IEC MPEG-1 Visual, ITU-T H.262 or ISO/IEC MPEG-2 Visual, ITU-T H.263, ISO/IEC MPEG-4 Visual and ITU-T H.264 (also known as ISO/IEC MPEG-4 Advanced Video Coding (AVC)). In addition, there are currently efforts underway with regards to the development of new video coding standards. One such standard under development is the scalable video coding (SVC) standard, which will become the scalable extension to H.264/AVC. Another such standard under development is the multi-view video coding (MVC), which will become another extension to H.264/AVC.
The latest draft of the SVC standard, at the time of filing the priority patent application, the Joint Draft 10, is available in JVT-W201, “Joint Draft 10 of SVC Amendment”, 23rd JVT meeting, San Jose, USA, April 2007, available at ftp3.itu.ch/av-arch/jvt-site/2007_04_SanJose/JVT-W201.zip. The latest joint draft of MVC, at the time of filing the priority application, is available in JVT-W209, “Joint Draft 3.0 on Multiview Video Coding”, 23rd JVT meeting, San Jose, USA, April 2007, available from ftp3.itu.ch/av-arch/jvt-site/2007_04_SanJose/JVT-W209.zip.
The earliest type scalability introduced to video coding standards was temporal scalability with B pictures in MPEG-1 Visual. In the B picture concept, a B picture is bi-predicted from two pictures, one preceding the B picture and one succeeding the B picture, both in display order. In addition, a B picture is a non-reference picture, i.e., it is not used for inter-picture prediction reference by other pictures. Consequently, the B pictures could be discarded to achieve a temporal scalability point with a lower frame rate. The same mechanism was retained in MPEG-2 Video, H.263 and MPEG-4 Visual.
In H.264/AVC, the concept of B pictures or B slices has been changed. The definition of B slice is as follows: A slice that may be decoded using intra-prediction from decoded samples within the same slice or inter-prediction from previously decoded reference pictures, using at most two motion vectors and reference indices to predict the sample values of each block. Both the bi-directional prediction property and the non-reference picture property of the conventional B picture concept are no longer valid. A block in a B slice may be predicted from two reference pictures in the same direction in display order, and a picture consisting of B slices may be referred by other pictures for inter-picture prediction.
In the previous video coding standards the display order and the decoding order of the pictures were closely related, i.e., the display order was pre-determined for a decoding order. On the other hand, H264 enables the explicit signaling of the output order of the pictures. A value of picture order count (POC) is derived from related syntax elements for each picture and is non-decreasing with increasing picture position in output order relative to the previous independent decoding refresh (IDR) picture or a picture containing a memory management control operation marking all pictures as “unused for reference.”
In H.264/AVC, SVC and MVC, temporal scalability can be achieved by using non-reference pictures and/or a hierarchical inter-picture prediction structure. By using only non-reference pictures, it is possible to achieve temporal scalability in a manner similar to using conventional B pictures in MPEG-1/2/4, by discarding non-reference pictures. A hierarchical coding structure can achieve more flexible temporal scalability.
FIG. 1 presents a typical hierarchical coding structure with four levels of temporal scalability. The display order is indicated by the values denoted as picture order count (POC). The I or P pictures, also referred to as key pictures, are coded as the first picture of a group of pictures (GOPs) in decoding order. When a key picture is inter coded, the previous key pictures are used as reference for inter-picture prediction. These pictures correspond to the lowest temporal level (denoted as TL in the figure) in the temporal scalable structure and are associated with the lowest frame rate. Pictures of a higher temporal level may only use pictures of the same or lower temporal level for inter-picture prediction.
With such a hierarchical coding structure, different temporal scalability corresponding to different frame rates can be achieved by discarding pictures of a certain temporal level value and beyond. In FIG. 1, for example, the pictures 0, 8 and 16 are of the lowest temporal level, while the pictures 1, 3, 5, 7, 9, 11, 13 and 15 are of the highest temporal level. Other pictures are assigned with other temporal levels hierarchically. These pictures of different temporal levels enable decoding of the bit stream at different frame rates. When decoding all of the temporal levels, a frame rate of 30 Hz can be obtained. Other frame rates can be obtained by discarding pictures of some temporal levels. The pictures of the lowest temporal level are associated with a frame rate of 3.25 Hz. A temporal scalable layer with a lower temporal level or a lower frame rate is referred to as a lower temporal layer.
The above hierarchical B picture coding structure is the most typical coding structure for temporal scalability. However, it should be noted that much more flexible coding structures are possible. For example, the GOP size does not have to be constant over time. As another example, the temporal enhancement layer pictures do not have to be coded as B slices; they may also be coded as P slices.
Supplemental Enhancement Information (SEI) messages are syntax structures that can be included in H.264/AVC bit streams. SEI messages are not required for the decoding of the sample values in output pictures but assist in related processes, such as picture output timing, rendering, error detection, error concealment, and resource reservation. A number of SEI messages are specified in H.264/AVC, SVC, and MVC. The user data SEI messages enable organizations and companies to specify SEI messages for their own use. The H.264/AVC, SVC, or MVC standard contains the syntax and semantics for the specified SEI messages, but no process for handling the messages in the decoder is defined. Consequently, encoders are required to follow the standard when they create SEI messages, and decoders conforming to the standard are not required to process SEI messages for output order conformance.
The scalability structure in SVC is characterized by three syntax elements: temporal_id, dependency_id and quality_id. The syntax element temporal_id is used to indicate the temporal scalability hierarchy or, indirectly, the frame rate. A scalable layer representation comprising pictures of a smaller maximum temporal_id value has a smaller frame rate than a scalable layer representation comprising pictures of a greater maximum temporal_id. A given temporal layer typically depends on the lower temporal layers (i.e., the temporal layers with smaller temporal_id values) but never depends on any higher temporal layer. The syntax element dependency_id is used to indicate the coarse granular scalability (CGS) inter-layer coding dependency hierarchy (which includes both signal-to-noise and spatial scalability). At any temporal level location, a picture of a smaller dependency_id value may be used for inter-layer prediction for coding of a picture with a greater dependency_id value. The syntax element quality_id is used to indicate the quality level hierarchy of a fine granular scalability (FGS) or medium granular scalability (MGS) layer. At any temporal location, and with an identical dependency_id value, a picture with quality_id equal to QL uses the picture with quality_id equal to QL-1 for inter-layer prediction. A coded slice with quality_id larger than 0 may be coded as either a truncatable FGS slice or a non-truncatable MGS slice. For simplicity, all of the data units (i.e. Network Abstraction Layer units or NAL units in the SVC context) in one access unit having identical value of dependency_id are referred to as a dependency unit or a dependency representation. Within one dependency unit, all the data units having identical value of quality_id are referred to as a quality unit or layer representation.
In H.264/AVC, the temporal level may be signaled by the sub-sequence layer number in the sub-sequence information SEI messages. The sub-sequence information SEI message maps a coded picture to a certain sub-sequence and sub-sequence layer. The sub-sequence SEI message may also include a frame number that increments by one per each reference frame in the sub-sequence in decoding order. Furthermore, the sub-sequence information SEI message includes an indication if a non-reference picture precedes the first reference picture of the sub-sequence, if a reference picture is the first reference picture of the sub-sequence, and if a picture is the last picture of the sub-sequence. The sub-sequence layer characteristics SEI message and the sub-sequence characteristics SEI message give statistical information, such as bit rate, on the indicated sub-sequence layer and sub-sequence, respectively. Furthermore, the dependencies between sub-sequences are indicated in the sub-sequence characteristics SEI message.
In SVC and MVC the temporal level is signaled in the Network Abstraction Layer unit header by the syntax element temporal_id. The bit rate and frame rate information for each temporal level is signaled in the scalability information SEI message.
In H.264/AVC, sub-sequence information SEI messages can be used to signal temporal scalable layers. Within one temporal layer (also referred to as a sub-sequence layer), the first picture in decoding order in a sub-sequence does not refer to any other picture in the same temporal layer. Therefore, if the decoding of the next lower layer has been started since the beginning of the bit stream, the decoding can be switched to the current layer at the first picture in decoding order of any sub-sequence of the current layer. However, if the decoding of the next lower layer has not started since the beginning of the bit stream, it is also possible that the temporal layer switching cannot be operated at the first picture in decoding order of a sub-sequence. For example, when a first picture picA1 in decoding order of a sub-sequence of a temporal layer layerA uses a decoded picture picB1 in the next lower layer layerB for inter prediction reference, if the decoding of the next lower layer layerB is started after picture picB1 in decoding order, then switching to the temporal layer layerA cannot be operated at picA1, because picA1 cannot be correctly decoded.
layerA. . .picA1 picA2 picA3layerBpicB1 picB2 . . .
While a sub-sequence characteristics SEI message can be used to indicate the prediction relationship of the sub-sequences carrying pictures picA1 and picB1, its use may not be straightforward in bit stream manipulation, as it requires the constant book-keeping of sub-sequence dependencies and the mapping between pictures and sub-sequences. This is undesirable in, for example, gateways. Furthermore, the sub-sequence characteristics SEI message is not capable of indicating prediction dependencies of single pictures. Therefore, the concluded decoding starting position in the next lower temporal layer may be too conservatively selected based on the sub-sequence characteristics SEI message.
In SVC, the scalability information SEI message includes a syntax element temporal_id_nesting_flag. If temporal_id_nesting_flag is equal to 1, and if the decoder is currently decoding a temporal layer X, then the decoding can be switched from temporal_id X to temporal_id Y>X after any picture picX with temporal_id equal to X. This can be done by continuing decoding all pictures with temporal_id<=Y that follow the picture picX in decoding order. In other words, the switching of temporal layers to temporal_id Y is possible at any point, as long as all those immediately preceding pictures that have a lower temporal_id are decoded. However, it is possible to have temporal_id_nesting_flag equal to 0 in order to have a higher coding efficiency. In this case, there is no way to know at which pictures the decoding can be switched to higher temporal layers.
A sample grouping in the ISO base media file format and its derivatives, such as the AVC file format and the SVC file format, is an assignment of each sample in a track to be a member of one sample group, based on a grouping criterion. A sample group in a sample grouping is not limited to being contiguous samples and may contain non-adjacent samples. As there may be more than one sample grouping for the samples in a track, each sample grouping has a type field to indicate the type of grouping. Sample groupings are represented by two linked data structures: (1) a SampleToGroup box represents the assignment of samples to sample groups; (2) a SampleGroupDescription box contains a sample group entry for each sample group describing the properties of the group. There may be multiple instances of the SampleToGroup and SampleGroupDescription boxes based on different grouping criteria. These are distinguished by a type field used to indicate the type of grouping.
Each SVC Scalable Group Entry of the SVC file format documents a portion of the bit stream. Each group is associated with a tier, where tiers define a set of operating points within a track, providing information about the operating points and instructions on how to access bit stream portions. Tiers represent layers of a SVC bit stream. Each SVC Scalable Group Entry documents and describes the various possible scalable operating points present within an SVC Elementary Stream. These entries are defined using a grouping type of “scif”. Though the Scalable Group entries are contained in the SampleGroupDescription box, the grouping is not a true sample grouping as each sample may be associated with more than one scalable group, as these groups are used to describe sections of the samples, i.e., the NAL units. As a result, it is possible that there may not be a SampleToGroup box of the grouping type “scif”, unless it happens that a group does, in fact, describe an entire sample. Even if a SampleToGroup box of the grouping type “scif” is present, the information is not needed for extraction of NAL units of tiers; the map groups must always document the “pattern” of NAL units within the samples.
In the SVC file format, a one-bit field is_tl_switching_point is included in the syntax structure ScalableGroupEntry( ). When is_tl_switching_point is equal to 1, the identified pictures are temporal layer switching points, such that switching from the next lower temporal layer can be operated at any of the identified pictures. These temporal layer switching points are equivalent to the first pictures in decoding order of sub-sequences signaled by sub-sequence information SEI messages. Therefore, the same problem arises as is discussed above with regard to H.264/AVC. In other words, when the decoding of the next lower layer is not started from the beginning of the bit stream, temporal layer switching may not be conducted at the indicated temporal layer switching points.