This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
Multimedia applications include services such as local playback, streaming or on-demand, conversational and broadcast/multicast services. Technologies involved in multimedia applications include, among others, media coding, storage and transmission. Different standards have been specified for different technologies.
Scalable coding produces scalable media streams, where a stream can be coded in multiple layers. In scalable coding, each layer, together with the required lower layers, is one representation of the media sequence at a certain spatial resolution, temporal resolution, certain quality level or some combination of the three. A portion of a scalable stream can be extracted and decoded at a desired spatial resolution, temporal resolution, certain quality level or some combination thereof. A scalable stream contains a non-scalable base layer and one or more enhancement layers. An enhancement layer may enhance the temporal resolution (i.e., the frame rate), the spatial resolution, or simply the quality of the video content represented by a lower layer or part thereof. In some cases, data in an enhancement layer can be truncated after a certain location, or even at arbitrary positions, where each truncation position may include additional data representing increasingly enhanced visual quality.
SVC is an example scalable coding of video. The latest draft of the SVC standard is described in JVT-S202, “Joint Scalable Video Model JSVM-6: Joint Draft 6 with proposed changes,” 19th Meeting, Geneva, Switzerland, April 2006, incorporated herein by reference in its entirety.
In multi-view video coding (MVC), video sequences output from different cameras, each corresponding to a view, are encoded into one bitsream. After decoding, to display a certain view, the decoded pictures belonging to that view are displayed. The latest draft of the MVC standard is described in JVT-T208, “Joint multiview video model (JMVM 1.0),” 20th JVT meeting, Klagenfurt, Austria, July 2006, incorporated herein by reference in its entirety.
In multiple description coding (MDC), an input media sequence is encoded into more than sub-streams, each of which is referred to as a description. Each description is independently decodable and represents a certain media quality. However, based on the decoding of one or more than one description, the additional decoding of another description can result in an improved media quality. MDC is discussed in detail in Y. Wang, A. Reibman, and S. Lin, “Multiple description coding for video delivery,” Proceedings of the IEEE, vol. 93, no. 1, January 2005, incorporated herein by reference in its entirety.
The file format is an important element in the chain of multimedia content production, manipulation, transmission and consumption. There is a difference between the coding format and the file format. The coding format relates to the action of a specific coding algorithm that codes the content information into a bitstream. In contrast, the file format comprises a system of organizing the generated bitstream in such way that it can be accessed for local decoding and playback, transferred as a file, or streamed, all utilizing a variety of storage and transport architectures. Further, the file format can facilitate the interchange and editing of the media. For example, many streaming applications require a pre-encoded bitstream on a server to be accompanied by metadata, stored in the “hint-tracks,” that assists the server to stream the video to the client. Examples information that can be included in hint-track metadata include timing information, indications of synchronization points, and packetization hints. This information is used to reduce the operational load of the server and to maximize the end-user experience.
Available media file format standards include the ISO file format (ISO/IEC 14496-12), the MPEG-4 file format (ISO/IEC 14496-14), the AVC file format (ISO/IEC 14496-15) and the 3GPP file format (3GPP TS 26.244). There is also a project in MPEG for development of the SVC file format, which will become an amendment to AVC file format. In a parallel effort, MPEG is defining a hint track format for FLUTE (file delivery over unidirectional transport) sessions.
The ISO file format is the base for the derivation of all the other above-referenced file formats. All of these file formats, including the ISO file format, are referred to as the ISO family of file formats. According to the ISO family of file formats, each file, hierarchically structured, contains exactly one movie box which may contain one or more tracks, and each track resides in one track box. It is possible for more than one track to store information of a certain media type. A subset of these tracks may form an alternate track group, wherein each track is independently decodable and can be selected for playback or transmission, and wherein only track in an alternate should be selected for playback or transmission.
All tracks in an alternate group are candidates for media selection. However, it may not make sense to switch between some of those tracks during a session. For example, one may allow switching between video tracks at different bit rates and keep the frame size, but not allow switching between tracks of different frame sizes. In the same manner one may want to the enable selection (but not switching) between tracks of different video codecs or different audio languages. The distinction between tracks for selection and switching is addressed by introducing a sub-group structure referred to as switch groups. All tracks in an alternate group are candidates for media selection, whereas, tracks in a switch (sub)group are also available for switching during a session. Different switch groups represent different operation points, such as different frame size, high/low quality, etc.
The ISO file format supports hint tracks that provide cookbook instructions for encapsulating data to transmission packets and transmission of the formed packets according to certain timestamps. The hint track mechanism can be used by servers, such as streaming servers, for real-time audio-visual data. The cookbook instructions may contain guidance for packet header construction and include packet payload construction. In the packet payload construction, data residing in other tracks or items may be referenced, i.e. it is indicated by a reference which piece of data in a particular track or item is instructed to be copied into a packet during the packet construction process. The hint track mechanism is extensible to any transport protocols and, currently, the hint track format for Real-Time Transport Protocol (RTP, IETF RFC 3550 (www.ietf.org/rfc/rfc3550.txt) (incorporated herein by reference in its entirety)) is specified and the hint track format for file delivery protocols over uni-directional channels, such as FLUTE (IETF RFC 3926 (www.ietf.org/rfc/rfc3926.txt) (incorporated herein by reference in its entirety)) and ALC (IETF RFC 3450 (www.ietf.org/rfc/rfc3450.txt) (incorporated herein by reference in its entirety)) is undergoing the standardization process.
As discussed above, the ISO family of file formats supports hint tracks. The draft SVC file format supports a data structure referred to as an extractor. An extractor is similar to a hint sample but is not specific to any transport protocol. An extractor references to a subset of the data of a media sample, where the referenced data corresponds to the data needed in that sample for the decoding and playback of a certain scalable layer.
For multicast applications with scalable media streams, information of a scalable stream may be stored in different tracks, with each track corresponding to a scalable layer or a number of contiguous layers. These tracks can be hint tracks as well as extractor tracks. As discussed herein, an “extractor track” refers to a track that contains extractors and possibly also non-extractor samples, e.g., media data units. This way, if layered multicast is applied, the sub-stream in each track can be sent in its own Real-time Transport Protocol (RTP) session, and a receiver subscribes to a number of the RTP sessions containing the desired scalable layer and the lower required layers. These tracks are referred to as a layered track group. Tracks in a layered track group together form an independently decodable scalable stream, while decoding of the sub-stream corresponding to each individual track in a layered track group may depend on sub-streams corresponding to other tracks. The above also applies to multi-view video streams, where each view is considered as a scalable layer. Similarly, for a MDC stream, information of each sub-stream or description may also be stored in its own track. These tracks corresponding to all of the descriptions of a MDC stream are referred to as a MDC track group.
FLUTE, which is discussed at IETF Request for Comments (RFC) No. 3926 (www.ietf.org/rfc/rfc3926.txt) and is incorporated herein by reference in its entirety, has been widely adopted as the file delivery protocol for multicast and broadcast applications. FLUTE is based on the asynchronous layered coding (ALC) protocol, which is discussed in the IETF RFC 3450 (www.ietf/org/rfc/rfc3450.txt), and the layered coding transport (LCT) protocol, which is discussed in the IETF RFC 3451 (www.ietf.org/rfc/rfc3451.txt). FLUTE inherits all of the functionalities defined in the ALC and LCT protocols, both of which are incorporated herein by reference in their entireties. LCT defines the notion of LCT channels to allow for massive scalability. The LCT scalability has been designed based on the Receiver-driven Layered Multicast (RLM) principle, where receivers are responsible of implementing an appropriate congestion control algorithm based on the adding and removing of layers of the delivered data. The sender sends the data into different layers, with each being addressed to a different multicast group.
In LCT, one or multiple channels may be used for the delivery of the files of a FLUTE session. A great flexibility is given to the FLUTE sender with regard to how the data is partitioned among the LCT channels. A common use case is to send the same data on all different LCT channels but at different bitrates. Additionally, the FLUTE sender may act intelligently to enable receivers to acquire all files of the FLUTE session by joining all channels for a shorter time than is normally required with one channel. In such a case, the data sent over each channel complements the data of other channels.
Information about the LCT channels of a FLUTE session, as well as how data is split between the different channels, is currently being defined in the FLUTE hint track specification. Such information will help the FLUTE server to select the right channels and merge them into the appropriate FLUTE session according to the target application. If a set of FLUTE hint tracks are independent from each other and only one of those hint tracks is intended to be processed during a FLUTE session, then they belong to an alternate track group, and at least a subset of them belong to one switching track group. If a set of FLUTE hint tracks are complementary to each other, they belong to a layered track group.
Current file format designs do not support the signaling of layered or MDC track groups. In addition, the current signaling of alternate or switching track groups is to include an alternate or switching group ID in a track-level data structure (a track header box for alternate track groups and a track selection box in a track-level user data box). This entails the parsing of all of the tracks in a movie box in order to obtain the information of alternate or switching track groups. If the number of tracks is great, then the parsing complexity is non-trivial.