This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
Video coding standards include ITU-T H.261, ISO/IEC MPEG-1 Visual, ITU-T H.262 or ISO/IEC MPEG-2 Visual, ITU-T H.263, ISO/IEC MPEG-4 Visual and ITU-T H.264 (also know as ISO/IEC MPEG-4 AVC). In addition, there are currently efforts underway with regards to the development of new video coding standards. One such standard under development is the scalable video coding (SVC) standard, which will become the scalable extension to H.264/AVC. Another standard under development is the multivideo coding standard (MVC), which is also an extension of H.264/AVC. Yet another such effort involves the development of Chinese video coding standards.
The latest draft of the SVC is described in JVT-T201, “Joint Draft 7 of SVC Amendment,” 20th JVT Meeting, Klagenfurt, Austria, July 2006, available from http://ftp3.itu.ch/av-arch/jvt-site/2006—07_Klagenfurt/JVT-T201.zip. The latest draft of MVC is in described in JVT-T208, “Joint Multiview Video Model (JMVM) 1.0”, 20th JVT meeting, Klagenfurt, Austria, July 2006, available from http://ftp3.itu.ch/av-arch/jvt-site/2006—07_Klagenfurt/JVT-T208.zip. Both of these documents are incorporated herein by reference in their entireties.
In scalable video coding (SVC), a video signal can be encoded into a base layer and one or more enhancement layers constructed in a layered fashion. An enhancement layer enhances the temporal resolution (i.e., the frame rate), the spatial resolution, or the quality of the video content represented by another layer or a portion of another layer. Each layer, together with its dependent layers, is one representation of the video signal at a certain spatial resolution, temporal resolution and quality level. A scalable layer together with its dependent layers are referred to as a “scalable layer representation.” The portion of a scalable bitstream corresponding to a scalable layer representation can be extracted and decoded to produce a representation of the original signal at certain fidelity.
In some cases, data in an enhancement layer can be truncated after a certain location, or at arbitrary positions, where each truncation position may include additional data representing increasingly enhanced visual quality. Such scalability is referred to as fine-grained (granularity) scalability (FGS). In contrast to FGS, the scalability provided by those enhancement layers that cannot be truncated is referred to as coarse-grained (granularity) scalability (CGS). CGS collectively includes traditional quality (SNR) scalability and spatial scalability.
The Joint Video Team (JVT) has been in the process of developing a SVC standard as an extension to the H.264/Advanced Video Coding (AVC) standard. SVC uses the same mechanism as H.264/AVC to provide temporal scalability. In AVC, the signaling of temporal scalability information is realized by using sub-sequence-related supplemental enhancement information (SEI) messages.
SVC uses an inter-layer prediction mechanism, wherein certain information can be predicted from layers other than the currently reconstructed layer or the next lower layer. Information that can be inter-layer predicted include intra texture, motion and residual data. Inter-layer motion prediction includes the prediction of block coding mode, header information, etc., wherein motion information from the lower layer may be used for prediction of the higher layer. In the case of intra coding, a prediction from surrounding macroblocks or from co-located macroblocks of lower layers is possible. These prediction techniques do not employ motion information and hence, are referred to as intra prediction techniques. Furthermore, residual data from lower layers can also be employed for prediction of the current layer.
The elementary unit for the output of an SVC encoder and the input of a SVC decoder is a Network Abstraction Layer (NAL) unit. A series of NAL units generated by an encoder is referred to as a NAL unit stream. For transport over packet-oriented networks or storage into structured files, NAL units are typically encapsulated into packets or similar structures. In the transmission or storage environments that do not provide framing structures, a bytestream format, which is similar to a start code-based bitstream structure, has been specified in Annex B of the H.264/AVC standard. The bytestream format separates NAL units from each other by attaching a start code in front of each NAL unit.
A problem associated with this layered coding approach is that creation of small discrete layers (in terms of bit rate) leads to very poor coding efficiency, because information present in the base layer tends to be partially duplicated in the enhancement layer and is thus coded twice. On the other hand, since the size of discrete layers controls how accurately a desired bit rate or quality may be achieved, if large enhancement layers are used, the bit rate or quality cannot be controlled with much granularity. This “coarse-grained scalability” (CGS) may provide an insufficient degree of control for some applications.
To balance these two problems, the concept of medium-grained scalability (MGS) has been proposed. MGS involves the same encoder and decoder structure as CGS, but in an intermediate stage, a “quality level” is assigned to each CGS enhancement layer slice according to a rate-distortion measure. When truncating the bit stream to a desired bit rate, CGS slices from the highest quality level are discarded first, and so on until the target bit rate is achieved.
It is important to note that the number of CGS layers assigned to a given quality level may not be constant throughout the sequence but may vary from one frame to another. For example, {1A, 1B, 1C}, {2A, 2B, 2C}, {3A, 3B, 3C} may represent nine slices. The number indicates the frame number, and the letter indicates the CGS layer. The base quality of the first frame is {1A}, an intermediate quality of the first frame is formed from {1A, 1B}, and the maximum quality of the first frame is formed from {1A, 1B, 1C}. The base-layer representation of the entire three-frame sequence would consist of {1A, 2A, 3A}. Conventionally, the first CGS layer would consist of {1B, 2B, 3B}. With MGS, the first quality layer might contain {1B, 1C, 3B}, representing two CGS enhancements from the first frame, none from the second, and one from the third.
The result is that the average number of CGS layers in a sequence is not restricted to integer values, but may vary depending upon the construction of the “quality layer”, and since the CGS coding structure is used, the coding efficiency penalty is relatively minor.