Multimedia applications include local playback, streaming or on-demand, conversational and broadcast/multicast services. Technologies involved in multimedia applications include, for example, media coding, storage and transmission. Media types include speech, audio, image, video, graphics and time text. Different standards have been specified for different technologies.
Video coding standards include ITU-T H.261, ISO/IEC MPEG-1 Visual, ITU-T H.262 or ISO/IEC MPEG-2 Visual, ITU-T H.263, ISO/IEC MPEG-4 Visual and ITU-T H.264 (also know as ISO/IEC MPEG-4 AVC). In addition, new video coding standards are also being developed. For example, the development of a scalable video coding (SVC) standard is currently underway. This standard will become the scalable extension to H.264/AVC. The development of China video coding standards is also currently underway.
Scalable video coding can provide scalable video bit streams. A portion of a scalable video bit stream can be extracted and decoded with a degraded playback visual quality. A scalable video bit stream contains a non-scalable base layer and one or more enhancement layers. An enhancement layer may enhance the temporal resolution (i.e. the frame rate), the spatial resolution, or simply the quality of the video content represented by the lower layer or part thereof. In some cases, data of an enhancement layer can be truncated after a certain location, even at arbitrary positions. Each truncation position may include some additional data representing increasingly enhanced visual quality. Such scalability is referred to as fine-grained (granularity) scalability (FGS). In contrast to FGS, the scalability provided by a quality enhancement layer that does not provide fined-grained scalability is referred to as coarse-grained scalability (CGS).
The scalable layer structure in the current draft SVC standard is characterized by three variables, referred to as temporal_level, dependency_id and quality_level, that are signaled in the bit stream or can be derived according to the specification. temporal_level is used to indicate the temporal scalability or frame rate. A layer comprising pictures of a smaller temporal_level value has a smaller frame rate than a layer comprising pictures of a larger temporal_level. dependency_id is used to indicate the inter-layer coding dependency hierarchy. At any temporal location, a picture of a smaller dependency_id value may be used for inter-layer prediction for coding of a picture with a larger dependency_id value. quality_level is used to indicate FGS layer hierarchy. At any temporal location and with identical dependency_id value, an FGS picture with quality_level value equal to QL uses the FGS picture or base quality picture (i.e., the non-FGS picture when QL-1=0) with quality_level value equal to QL-1 for inter-layer prediction.
FIG. 1 shows a temporal segment of an exemplary scalable video stream with the displayed values of the three variables discussed above. It should be noted that the time values are relative, i.e. time=0 does not necessarily mean the time of the first picture in display order in the bit stream. A typical prediction reference relationship of the example is shown in FIG. 2, where solid arrows indicate the inter prediction reference relationship in the horizontal direction, and dashed block arrows indicate the inter-layer prediction reference relationship. The pointed-to instance uses the instance in the other direction for prediction reference.
As discussed herein, a layer is defined as the set of pictures having identical values of temporal_level, dependency_id and quality_level, respectively. To decode and playback an enhancement layer, typically the lower layers including the base layer should also be available, because the lower layers may be directly or indirectly used for inter-layer prediction in the coding of the enhancement layer. For example, in FIGS. 1 and 2, the pictures with (t, T, D, Q) equal to (0, 0, 0, 0) and (8, 0, 0, 0) belong to the base layer, which can be decoded independently of any enhancement layers. The picture with (t, T, D, Q) equal to (4, 1, 0, 0) belongs to an enhancement layer that doubles the frame rate of the base layer; the decoding of this layer needs the presence of the base layer pictures. The pictures with (t, T, D, Q) equal to (0, 0, 0, 1) and (8, 0, 0, 1) belong to an enhancement layer that enhances the quality and bit rate of the base layer in the FGS manner; the decoding of this layer also needs the presence of the base layer pictures.
In the H.264/AVC standard, instantaneous decoding refresh (IDR) picture is defined, as follows. A coded picture in which all slices are I or SI slices cause the decoding process to mark all reference pictures as “unused for reference” immediately after decoding the IDR picture. After the decoding of an IDR picture, all following coded pictures in decoding order can be decoded without inter prediction from any picture decoded prior to the IDR picture. The first picture of each coded video sequence is an IDR picture.
The concept of an IDR picture is also used in the current draft SVC standard, wherein the definition is applicable to pictures with identical values of dependency_id and quality_level, respectively. In other words, an IDR picture is a coded picture in which the decoding of the IDR picture and all the following coded pictures in decoding order in the same layer (i.e. with the same values of dependency_id and quality_level, respectively, as the IDR picture) can be performed without inter prediction from any picture prior to the IDR picture in decoding order in the same layer. An IDR picture causes the decoding process to mark all reference pictures in the same layer as “unused for reference” immediately after decoding the IDR picture. It should be noted that as used herein for the context of the current draft SVC standard, the term “in the same layer” means that the decoded pictures are with the same values of dependency_id and quality_level, respectively, as the IDR picture. Either all pictures with an identical value of picture order count (i.e. at the same temporal location) but different values of dependency_id or quality_level, are coded as IDR pictures, or no picture for a specific value of picture order count is coded as IDR picture. In other words, either all pictures in an access unit (including all of the pictures with an identical value of picture order count) are IDR pictures or no picture in an access unit is an IDR picture.
Available media file format standards include ISO file format (ISO/IEC 14496-12), MPEG-4 file format (ISO/IEC 14496-14), AVC file format (ISO/IEC 14496-15) and 3GPP file format (3GPP TS 26.244). The SVC file format is currently under development by ISO/IEC MPEG and can be found at MPEG N7477, “VM Study Text for Scalable Video Coding (SVC) File Format”, 73rd ISO/IEC MPEG meeting, Poznan, Poland, July 2005, incorporated by reference herein in its entirety.
One advantage of scalable coding compared to single-layer coding is that, with scalable coding, a single stream can meet different requirements of quality, bit rate, display size etc, while with single-layer coding, multiple streams must be used. Using multiple streams costs more store space and, in simulcast, more transmission bandwidth. In streaming applications, stream adaptation is needed when the capabilities of the transmission network or recipient(s) change compared to their earlier states, e.g. a change of transmission bandwidth. Gateways and other media-aware network elements (MANEs) could also perform stream adaptation. When a scalably coded file is played “locally” (i.e., the file resides in the same device as the decoder or resides in memory connected with a fast link to the decoding device), stream adaptation may be needed if the decoder shares computational resources with some other processing. For example, if decoding is performed on a general-purpose processor running a multi-process operating system, the decoder may be able to use full computational power at one time and decode all the scalable layers. At another time, however, it may only have a subset of the processor's computational power in its use to decode a subset of the available scalable layers. The adapted stream may have a changed bit rate, frame rate, and/or video resolution. With single-layer coding, stream adaptation can be performed through stream switching or transcoding. With single scalable stream, stream adaptation can be performed through layer switching.
In scalable coding, high-to-low layer switching can be performed at any location. However, the case is different for low-to-high layer switching, since decoding of the switch-to picture in the high layer typically requires the presence in the same layer of some previous pictures in decoding order.
For the current draft SVC standard, low-to-high layer switching can be performed at an IDR access unit (including IDR pictures). However, relying on an IDR access unit either causes reduced coding efficiency due to frequent coding of IDR access units or non-prompt stream adaptation. Both of these issues are closely related to the end user experience. It is also theoretically possible to utilize the SP/SI picture coding or gradual decoding refresh technique to enable low-to-high layer switching. However, these techniques were designed for single-layer coding. Therefore, these techniques are not currently workable for scalable coding. Furthermore, even after these techniques are extended for use in scalable coding, their application will either result in additional coding constraints (in the form of equivalent to lower coding efficiency) or implementation complexity.
There is therefore a need for supporting simple and efficient low-to-high layer switching in scalable video coding. Furthermore, there is also a need for enabling the signaling of simple and efficient low-to-high layer switching in the file format level such that no parsing and analysis of the video bit stream is required to find the places for low-to-high layer switching, as parsing and analysis of stream could require complex computations.