This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
Video coding standards include ITU-T H.261, ISO/IEC MPEG-1 Visual, ITU-T H.262 or ISO/IEC MPEG-2 Visual, ITU-T H.263, ISO/IEC MPEG-4 Visual and ITU-T H.264 (also known as ISO/IEC MPEG-4 AVC). In addition, there are currently efforts underway with regards to the development of new video coding standards. One such standard under development is the SVC standard, which will become the scalable extension to H.264/AVC. Another standard under development is the multi-view coding standard (MVC), which is also an extension of H.264/AVC. Yet another such effort involves the development of China video coding standards.
A draft of the SVC standard is described in JVT-V201, “Joint Draft 9 of SVC Amendment”, 22nd JVT meeting, Marrakech, Morocco, January 2007. A draft of the MVC standard is in described in JVT-V209, “Joint Draft 2.0 on Multiview Video Coding”, 22nd JVT meeting, Marrakech, Morocco, January 2007.
Scalable media is typically ordered into hierarchical layers of data, where a video signal can be encoded into a base layer and one or more enhancement layers. A base layer can contain an individual representation of a coded media stream such as a video sequence. Enhancement layers can contain refinement data relative to previous layers in the layer hierarchy. The quality of the decoded media stream progressively improves as enhancement layers are added to the base layer. An enhancement layer enhances the temporal resolution (i.e., the frame rate), the spatial resolution, and/or simply the quality of the video content represented by another layer or part thereof. Each layer, together with all of its dependent layers, is one representation of the video signal at a certain spatial resolution, temporal resolution and/or quality level. Therefore, the term “scalable layer representation” is used herein to describe a scalable layer together with all of its dependent layers. The portion of a scalable bitstream corresponding to a scalable layer representation can be extracted and decoded to produce a representation of the original signal at a certain fidelity.
The earliest type of scalability introduced to video coding standards was temporal scalability with B pictures in MPEG-1 Visual. According to this B picture temporal scalability, a B picture is bi-predicted from two pictures, one picture precedes the B picture and the other picture succeeds the B picture, both in display order. In addition, a B picture is a non-reference picture, i.e., it is not used for inter-picture prediction reference by other pictures. Consequently, B pictures can be discarded to achieve a temporal scalability point with a lower frame rate. The same mechanism was retained in MPEG-2 Video, H.263 and MPEG-4 Visual.
In H.264/AVC, the concept of B pictures or B slices has been generalized. A block in a B slice may be predicted from two reference pictures in the same direction in display order, and a picture consisting of B slices may be referred to by other pictures for inter-picture prediction. Both the bi-directional prediction property and the non-reference picture property of conventional B picture temporal scalability are no longer valid.
In H.264/AVC, SVC and MVC, temporal scalability can be achieved by using non-reference pictures and/or hierarchical inter-picture prediction structure described in greater detail below. It should be noted that by using only non-reference pictures, it is possible to achieve similar temporal scalability as that achieved by using conventional B pictures in MPEG-1/2/4. This can be accomplished by discarding non-reference pictures. Alternatively, use of a hierarchical coding structure can achieve a more flexible temporal scalability.
FIG. 1 illustrates a conventional hierarchical coding structure with four levels of temporal scalability. A display order is indicated by the values denoted as picture order count (POC). The I or P pictures, also referred to as key pictures, are coded as a first picture of a group of pictures (GOPs) in decoding order. When a key picture is inter coded, the previous key pictures are used as a reference for inter-picture prediction. Therefore, these pictures correspond to the lowest temporal level (denoted as TL in FIG. 1) in the temporal scalable structure and are associated with the lowest frame rate. It should be noted that pictures of a higher temporal level may only use pictures of the same or lower temporal level for inter-picture prediction. With such a hierarchical coding structure, different temporal scalability corresponding to different frame rates can be achieved by discarding pictures of a certain temporal level value and beyond.
For example, referring back to FIG. 1, pictures 0, 108, and 116 are of the lowest temporal level, i.e., TL 0, while pictures 101, 103, 105, 107, 109, 111, 113, and 115 are of the highest temporal level, i.e., TL 3. The remaining pictures 102, 106, 110, and 114 are assigned to another TL in hierarchical fashion and compose a bitstream of a different frame rate. It should be noted that by decoding all of the temporal levels in a GOP, the highest a frame rate can be achieved. Lower frame rates can be obtained by discarding pictures of certain temporal levels. It should be noted that a temporal scalable layer with a lower temporal level or a lower frame rate can also be referred to as a lower temporal layer.
The hierarchical B picture coding structure described above is a typical coding structure for temporal scalability. However, it should be noted that more flexible coding structures are possible. For example, the GOP size does not have to be constant over time. Alternatively still, temporal enhancement layer pictures do not have to be coded as B slices, but rather may be coded as P slices.
The concept of a video coding layer (VCL) and a network abstraction layer (NAL) is inherited from advanced video coding (AVC). The VCL contains the signal processing functionality of the codec, e.g., mechanisms such as transform, quantization, motion-compensated prediction, loop filter, and inter-layer prediction. A coded picture of a base or enhancement layer consists of one or more slices. The NAL encapsulates each slice generated by the VCL into one or more NAL units.
Each SVC layer is formed by NAL units, representing the coded video bits of the layer. A Real Time Transport Protocol (RTP) stream carrying only one layer would carry NAL units belonging to that layer only. An RTP stream carrying a complete scalable video bit stream would carry NAL units of a base layer and one or more enhancement layers. SVC specifies the decoding order of these NAL units.
In some cases, data in an enhancement layer can be truncated after a certain location, or at arbitrary positions, where each truncation position may include additional data representing increasingly enhanced visual quality. In cases where the truncation points are closely spaced, the scalability is said to be “fine-grained”, hence the term “fine grained (granular) scalability” (FGS). In contrast to FGS, the scalability provided by those enhancement layers that can only be truncated at certain coarse positions is referred to as “coarse-grained (granularity) scalability” (CGS). In addition, the draft SVC coding standard noted above can also support what is conventionally referred to as “medium grained (granular) scalability” (MGS). According to MGS, quality enhancement pictures are coded similarly to CGS scalable layer pictures, but can be indicated by high-level syntax elements as is similarly done with FGS layer pictures. It may be noted that enhancement layers can collectively include CGS, MGS, and FGS quality (SNR) scalability and spatial scalability.
According to H.264/AVC, an access unit comprises one primary coded picture. In some systems, detection of access unit boundaries can be simplified by inserting an access unit delimiter NAL unit into the bitstream. In SVC, an access unit may comprise multiple primary coded pictures, but at most one picture per each unique combination of dependency_id, temporal_id, and quality_id. A coded picture as described herein can refer to all of the NAL units within an access unit having particular values of dependency_id and quality_id. It is noted that the terms to be used in SVC can change. Therefore, what may be referred to as a coded picture herein may be subsequently referenced by another term, such as a layer representation.
SVC uses a similar mechanism as that used in H.264/AVC to provide hierarchical temporal scalability. In SVC, a certain set of reference and non-reference pictures can be dropped from a coded bitstream without affecting the decoding of the remaining bitstream. Hierarchical temporal scalability requires multiple reference pictures for motion compensation, i.e., there is a reference picture buffer containing multiple decoded pictures from which an encoder can select a reference picture for inter prediction. In H.264/AVC, a feature called sub-sequences enables hierarchical temporal scalability, where each enhancement layer contains sub-sequences and each sub-sequence contains a number of reference and/or non-reference pictures. The sub-sequence is also comprised of a number of inter-dependent pictures that can be disposed without any disturbance to any other sub-sequence in any lower sub-sequence layer. The sub-sequence layers are hierarchically arranged based on their dependency on each other and are equivalent to temporal levels in SVC. Therefore, when a sub-sequence in the highest sub-sequence layer is disposed, the remaining bitstream remains valid. In H.264/AVC, signaling of temporal scalability information is effectuated by using sub-sequence-related supplemental enhancement information (SEI) messages. In SVC, the temporal level hierarchy is indicated in the header of NAL units.
The file format is an important element in the chain of multimedia content production, manipulation, transmission and consumption. There is a difference between the coding format and the file format. The coding format relates to the action of a specific coding algorithm that codes the content information into a bitstream. In contrast, the file format comprises a system/structure(s) for organizing a generated bitstream in such way that it can be accessed for local decoding and playback, transferred as a file, or streamed, all utilizing a variety of storage and transport architectures. Further, the file format can facilitate the interchange and editing of the media. For example, many streaming applications require a pre-encoded bitstream on a server to be accompanied by metadata, stored in the “hint-tracks”, that assists the server to stream the video to the client. Examples of information that can be included in hint-track metadata include timing information, indications of synchronization points, and packetization hints. This information is used to reduce the operational load of the server and to maximize the end user experience.
One available media file format standard includes the object-oriented, ISO base media file format file structure, where a file can be decomposed into its constituent objects and the structure of the constituent objects can be inferred directly from their type and position. In addition, the ISO base media file format is designed to contain timed media information for a presentation in a flexible, extensible format, which facilitates interchange, management, editing, and presentation of the media. The actual files have a logical structure, a time structure, and a physical structure, although these structures need not be coupled.
The logical structure of the file can be likened to that of a “movie”, which contains a set of time-parallel tracks. The time structure of the file is represented by the tracks containing sequences of samples in time, and those sequences are mapped into a timeline of the overall movie by optional edit lists. The physical structure of the file separates the data needed for logical, time, and structural de-composition, from the media data samples themselves. This structural information is represented by the tracks documenting the logical and timing relationships of the samples and also contains pointers to where they are located. The pointers can reference the media data within the same file or within another one, referenced, for example, by a uniform resource locator.
Each media stream is contained in a track specialized for that media type (audio, video, etc.), and is further parameterized by a sample entry. The sample entry contains the “name” of the exact media type (i.e., the type of decoder needed to decode the stream) and any parameterization of that decoder that is needed. In addition, tracks are synchronized by the time stamps of the media samples. Furthermore, tracks can be linked together by track references, where the tracks can form alternatives to each other, e.g., two audio tracks containing different languages.
Some samples within a track have special characteristics or need to be individually identified, e.g., synchronization points (often a video I-frame). These synchronization points are identified by a special table in each track. More generally, the nature of dependencies between track samples can also be documented. Furthermore, a concept of named, parameterized sample groups can be utilized. These named, parameterized sample groups permit the documentation of arbitrary characteristics, which are shared by some of the samples in a track. In the SVC file format, sample groups are used to describe samples with a certain NAL unit structure.
All files start with a file-type box that defines the best use of the file and the specifications to which the file complies, which are documented as “brands.” The presence of a brand in a file-type box indicates both a claim and a permission: a claim by the file writer that the file complies with the specification; and a permission for a reader, possibly implementing only that specification, to read and interpret the file.
In the case of the movie structure described above, the “movie” box can contain a set of “track” boxes, e.g., a track box for a video track, a track box for an audio track, and a track box for a hint track. In turn, each track can contain for one stream, information including, but not limited to timing, nature of the material, e.g., video, audio, etc., visual information, initialization information (e.g., sample entry tables), and information on where coding data can be found, its size, et. In other words, a track box can contain metadata related to the actual media content data. For example, each track can contain, among other elements, a sample table box with a sample description box, where the sample description box holds certain information, e.g., the information contained in the decoder configuration record for MPEG-4 AVC video, which is needed by the decoder in order to initialize. Furthermore, the sample table box holds a number of tables, which contain timing information and pointers to the media data. In other words, the video and audio data can be stored interleaved in chunks within a media data container/box. Lastly, the hint track can contain precomputed instructions on how to process the file for streaming.
In addition, with SVC, it is possible to signal information regarding multiple decoding times using SEI messages. However, extracting the required decoding times from an SEI message requires a file reader to be equipped with entropy decoders. In addition, parsing of media data samples to find SEI messages that contain information regarding decoding times can also be a burden. Such requirements, therefore, can result in adding implementation and computational complexities to those servers that offer subsets of stored bitstreams. The ISO base media file format and its derivatives (e.g., the SVC file format) allow for signaling a decoding time for each sample containing one access unit. However, for scalable media, when only a subset of samples or sample subsets are required to be decoded, the decoding time of each sample or sample subset would be different than when the entire stream is to be decoded.