This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
Video coding standards include ITU-T H.261, ISO/IEC MPEG-1 Visual, ITU-T H.262 or ISO/IEC MPEG-2 Visual, ITU-T H.263, ISO/IEC MPEG-4 Visual and ITU-T H.264 (also know as ISO/IEC MPEG-4 AVC). In addition, there are currently efforts underway with regards to the development of new video coding standards. One such standard under development is the SVC standard, which will become the scalable extension to H.264/AVC. Another standard under development is the multi-view coding standard (MVC), which is also an extension of H.264/AVC. Yet another such effort involves the development of China video coding standards.
The latest draft of the SVC is described in JVT-U201, “Joint Draft 8 of SVC Amendment”, 21st JVT meeting, HangZhou, China, October 2006, available at ftp3.itu.ch/av-arch/jvt-site/2006—10_Hangzhou/JVT-U201.zip. The latest draft of MVC is in described in JVT-U209, “Joint Draft 1.0 on Multiview Video Coding”, 21st JVT meeting, HangZhou, China, October 2006, available at ftp3.itu.ch/av-arch/jvt-site/2006—10_Hangzhou/JVT-U209.zip. Both of these documents are incorporated herein by reference in their entireties.
Scalable media is typically ordered into hierarchical layers of data. A base layer contains an individual representation of a coded media stream such as a video sequence. Enhancement layers contain refinement data relative to previous layers in the layer hierarchy. The quality of the decoded media stream progressively improves as enhancement layers are added to the base layer. An enhancement layer enhances the temporal resolution (i.e., the frame rate), the spatial resolution, or simply the quality of the video content represented by another layer or part thereof. Each layer, together with all of its dependent layers, is one representation of the video signal at a certain spatial resolution, temporal resolution and quality level. Therefore, the term “scalable layer representation” is used herein to describe a scalable layer together with all of its dependent layers. The portion of a scalable bitstream corresponding to a scalable layer representation can be extracted and decoded to produce a representation of the original signal at a certain fidelity.
The concept of a video coding layer (VCL) and network abstraction layer (NAL) is inherited from advanced video coding (AVC). The VCL contains the signal processing functionality of the codec; mechanisms such as transform, quantization, motion-compensated prediction, loop filter, inter-layer prediction. A coded picture of a base or enhancement layer consists of one or more slices. The NAL encapsulates each slice generated by the VCL into one or more NAL units. A NAL unit comprises a NAL unit header and a NAL unit payload. The NAL unit header includes the NAL unit type indicating whether the NAL unit contains a coded slice, a coded slice data partition, a sequence or picture parameter set, etc. A NAL unit stream is a concatenation of a number of NAL units. An encoded bitstream according to H.264/AVC or its extensions, e.g. SVC, is either a NAL unit stream or a byte stream by prefixing a start code to each NAL unit in a NAL unit stream.
Each SVC layer is formed by NAL units, representing the coded video bits of the layer. A Real Time Transport Protocol (RTP) stream carrying only one layer would carry NAL units belonging to that layer only. An RTP stream carrying a complete scalable video bit stream would carry NAL units of a base layer and one or more enhancement layers. SVC specifies the decoding order of these NAL units.
In some cases, data in an enhancement layer can be truncated after a certain location, or at arbitrary positions, where each truncation position may include additional data representing increasingly enhanced visual quality. In cases where the truncation points are closely spaced, the scalability is said to be “fine-grained,” hence the term “fine grained (granular) scalability” (FGS). In contrast to FGS, the scalability provided by those enhancement layers that can only be truncated at certain coarse positions is referred to as “coarse-grained (granularity) scalability” (CGS).
According to the H.264/AVC video coding standard, an access unit comprises one primary coded picture. In some systems, detection of access unit boundaries can be simplified by inserting an access unit delimiter NAL unit into the bitstream. In SVC, an access unit may comprise multiple primary coded pictures, but at most one picture per each unique combination of dependency_id, temporal_level, and quality_level.
Coded video bitstream may include extra information to enhance the use of the video for a wide variety purposes. For example, supplemental enhancement information (SEI) and video usability information (VUI), as defined in H264/AVC, provide such a functionality. The H.264/AVC standard and its extensions include the support of SEI signaling through SEI messages. SEI messages are not required by the decoding process to generate correct sample values in output pictures. Rather, they are helpful for other purposes, e.g., error resilience and display. H.264/AVC contains the syntax and semantics for the specified SEI messages, but no process for handling the messages in the recipient is defined. Consequently, encoders are required to follow the H.264/AVC standard when they create SEI messages, and decoders conforming to the H.264/AVC standard are not required to process SEI messages for output order conformance. One of the reasons to include the syntax and semantics of SEI messages in H.264/AVC is to allow system specifications, such as 3GPP multimedia specifications and DVB specifications, to interpret the supplemental information identically and hence interoperate. It is intended that system specifications can require the use of particular SEI messages both in encoding end and in decoding end, and the process for handling SEI messages in the recipient may be specified for the application in a system specification.
SVC uses a similar mechanism as that used in H.264/AVC to provide hierarchical temporal scalability. In SVC, a certain set of reference and non-reference pictures can be dropped from a coded bistream without affecting the decoding of the remaining bitstream. Hierarchical temporal scalability requires multiple reference pictures for motion compensation, i.e., there is a reference picture buffer containing multiple decoded pictures from which an encoder can select a reference picture for inter prediction. In H.264/AVC, a feature called sub-sequences enables hierarchical temporal scalability, where each enhancement layer contains sub-sequences and each sub-sequence contains a number of reference and/or non-reference pictures. The sub-sequence is also comprised of a number of inter-dependent pictures that can be disposed without any disturbance to any other sub-sequence in any lower sub-sequence layer. The sub-sequence layers are hierarchically arranged based on their dependency on each other. Therefore, when a sub-sequence in the highest enhancement layer is disposed, the remaining bitstream remains valid. In H.264/AVC, signaling of temporal scalability information is effectuated by using sub-sequence-related supplemental enhancement information (SEI) messages. In SVC, the temporal level hierarchy is indicated in the header of Network Abstraction Layer (NAL) units.
In addition, SVC uses an inter-layer prediction mechanism, whereby certain information can be predicted from layers other than a currently reconstructed layer or a next lower layer. Information that could be inter-layer predicted includes intra texture, motion and residual data. Inter-layer motion prediction also includes the prediction of block coding mode, header information, etc., where motion information from a lower layer may be used for predicting a higher layer. It is also possible to use intra coding in SVC, i.e., a prediction from surrounding macroblocks or from co-located macroblocks of lower layers. Such prediction techniques do not employ motion information and hence, are referred to as intra prediction techniques. Furthermore, residual data from lower layers can also be employed for predicting the current layer.
SVC, as described above, involves the encoding of a “base layer” with some minimal quality, as well as the encoding of enhancement information that increases the quality up to a maximum level. The base layer of SVC streams is typically advanced video coding (AVC)-compliant. In other words, AVC decoders can decode the base layer of an SVC stream and ignore SVC-specific data. This feature has been realized by specifying coded slice NAL unit types that are specific to SVC, were reserved for future use in AVC, and must be skipped according to the AVC specification.
An instantaneous decoding refresh (IDR) picture of H.264/AVC contains only intra-coded slices and causes all reference pictures except for the current picture to be marked as “unused for reference.” A coded video sequence is defined as a sequence of consecutive access units in decoding order from an IDR access unit, inclusive, to the next IDR access unit, exclusive, or to the end of the bitstream, whichever appears earlier. A group of pictures (GOP) in H.264/AVC refers to a number of pictures that are contiguous in decoding order, starting with an intra coded picture, ending with the first picture (exclusive) of the next GOP or coded video sequence in decoding order. All of the pictures within the GOP following the intra picture in output order can be correctly decoded, regardless of whether any previous pictures were decoded. An open GOP is such a group of pictures in which pictures preceding the initial intra picture in output order may not be correctly decodable. An H.264/AVC decoder can recognize an intra picture starting an open GOP from the recovery point SEI message in the H.264/AVC bitstream. The picture starting an open GOP is referred to herein as an open decoding refresh (ODR) picture. A closed GOP is such a group of pictures in which all pictures can be correctly decoded. In H.264/AVC, a closed GOP starts from an IDR access unit.
Coded pictures can be represented by an index, tl0_pic_idx. The index, tl0_pic_idx, is indicative of NAL units in a SVC bitstream with the same value of dependency_id and quality_level in one access unit, where temporal_level is equal to zero. For an IDR picture with temporal_level equal to zero, the value of tl0_pic_idx is equal to zero or any value in the range of 0 to N−1, inclusive, where N is a positive integer. For any other picture with temporal_level equal to zero, the value of tl0_pic_idx is equal to (tl0_pic_idx—0+1) % N, where tl0_pic_idx—0 is the value of tl0_pic_idx of a previous picture with temporal_level equal to 0, and % denotes a modulo operation. In the current SVC specification, tl0_pic_idx is included in the NAL unit header as a conditional field. A receiver or an MANE can examine the tl0_pic_idx values to determine whether it has received all the key pictures (i.e. pictures with temporal level equal to 0). In case a loss happens to a key picture, then a feedback may be sent to inform the encoder, which in turn may take some repair actions, e.g. retransmitting the lost key picture.
The RTP payload format for H.264/AVC is specified in Request for Comments (RFC) 3984 (available at www.rfc-editor.org/rfc/rfc3984.txt), and the draft RTP payload format for SVC is specified in the Internet Engineering Task Force (IETF) Internet-Draft draft-ietf-avt-rtp-svc-00 (available at tools.ietf.org/id/draft-ietf-avt-rtp-svc-00.txt).
RFC 3984 specifies several packetization modes, one of which is the interleaved mode. If the interleaved packetization mode is in use, then NAL units from more than one access units can be packetized into one RTP packets. RFC 3984 also specifies the concept of decoding order number (DON) that indicates the decoding orders of a NAL units conveyed in an RTP stream.
In the SVC RTP payload format draft, a new NAL unit type, referred to as payload content scalability information (PACSI) NAL unit, is specified. The PACSI NAL unit, if present, is the first NAL unit in an aggregation packet, and it is not present in other types of packets. The PACSI NAL unit indicates scalability characteristics that are common for all the remaining NAL units in the payload, thus making it easier for MANEs to decide whether to forward/process/discard the aggregation packet. Senders may create PACSI NAL units and receivers may ignore them, or use them as hints to enable efficient aggregation packet processing. When the first aggregation unit of an aggregation packet contains a PACSI NAL unit, there is at least one additional aggregation unit present in the same packet. The RTP header fields are set according to the remaining NAL units in the aggregation packet. When a PACSI NAL unit is included in a multi-time aggregation packet, the decoding order number for the PACSI NAL unit is set to indicate that the PACSI NAL unit is the first NAL unit in decoding order among the NAL units in the aggregation packet or the PACSI NAL unit has an identical decoding order number to the first NAL unit in decoding order among the remaining NAL units in the aggregation packet. The structure of PACSI NAL unit is the same as the four-byte SVC NAL unit header (where E is equal to 0), described below.