This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
The multimedia container file format is an important element in the chain of multimedia content production, manipulation, transmission and consumption. In this context, the coding format (i.e., the elementary stream format) relates to the action of a specific coding algorithm that codes the content information into a bitstream. The container file format comprises mechanisms for organizing the generated bitstream in such a way that it can be accessed for local decoding and playback, transferring as a file, or streaming, all utilizing a variety of storage and transport architectures. The container file format can also facilitate the interchanging and editing of the media, as well as the recording of received real-time streams to a file. As such, there are substantial differences between the coding format and the container file format.
The hierarchy of multimedia file formats is depicted generally at 1000 in FIG. 1. The elementary stream format 1100 represents an independent, single stream. Audio files such as .amr and .aac files are constructed according to the elementary stream format. The container file format 1200 is a format which may contain both audio and video streams in a single file. An example of a family of container file formats 1200 is based on the ISO base media file format. Just below the container file format 1200 in the hierarchy 1000 is the multiplexing format 1300. The multiplexing format 1300 is typically less flexible and more tightly packed than an audio/video (AV) file constructed according to the container file format 1200. Files constructed according to the multiplexing format 1300 are typically used for playback purposes only. A Moving Picture Experts Group (MPEG)-2 program stream is an example of a stream constructed according to the multiplexing format 1300. The presentation language format 1400 is used for purposes such as layout, interactivity, the synchronization of AV and discrete media, etc. Synchronized multimedia integration language (SMIL) and scalable video graphics (SVG), both specified by the World Wide Web Consortium (W3C), are examples of a presentation language format 1400. The presentation file format 1500 is characterized by having all parts of a presentation in the same file. Examples of objects constructed according to a presentation file format are PowerPoint files and files conforming to the extended presentation profile of the 3GP file format.
Available media and container file format standards include the ISO base media file format (ISO/IEC 14496-12), the MPEG-4 file format (ISO/IEC 14496-14, also known as the MP4 format), Advanced Video Coding (AVC) file format (ISO/IEC 14496-15) and the 3GPP file format (3GPP TS 26.244, also known as the 3GP format). There is also a project in MPEG for development of the scalable video coding (SVC) file format, which will become an amendment to advanced video coding (AVC) file format. In a parallel effort, MPEG is defining a hint track format for file delivery over unidirectional transport (FLUTE) and asynchronous layered coding (ALC) sessions, which will become an amendment to the ISO base media file format.
The Digital Video Broadcasting (DVB) organization is currently in the process of specifying the DVB file format. The primary purpose of defining the DVB file format is to ease content interoperability between implementations of DVB technologies, such as set-top boxes according to current (DVT-T, DVB-C, DVB-S) and future DVB standards, Internet Protocol (IP) television receivers, and mobile television receivers according to DVB-Handheld (DVB-H) and its future evolutions. The DVB file format will allow the exchange of recorded (read-only) media between devices from different manufacturers, the exchange of content using USB mass memories or similar read/write devices, and shared access to common disk storage on a home network, as well as other functionalities. The ISO base media file format is currently the strongest candidate as the basis for the development of the DVB file format. The ISO file format is the basis for the derivation of all the above-referenced container file formats (excluding the ISO file format itself). These file formats (including the ISO file format itself) are referred to as the ISO family of file formats.
The basic building block in the ISO base media file format is called a box. Each box includes a header and a payload. The box header indicates the type of the box and the size of the box in terms of bytes. A box may enclose other boxes, and the ISO file format specifies which box types are allowed within a box of a certain type. Furthermore, some boxes are mandatorily present in each file, while other boxes are simply optional. Moreover, for some box types, there can be more than one box present in a file. Therefore, the ISO base media file format essentially specifies a hierarchical structure of boxes.
FIG. 2 shows a simplified file structure according to the ISO base media file format. According to the ISO family of file formats, a file 200 includes media data and metadata that are enclosed in separate boxes, the media data (mdat) box 210 and the movie (moov) box 220, respectively. For a file to be operable, both of these boxes must be present. The media data box 210 contains video and audio frames, which may be interleaved and time-ordered. The movie box 220 may contain one or more tracks, and each track resides in one track box 240. For the presentation of one media type, typically one track is selected.
It should be noted that the ISO base media file format does not limit a presentation to be contained in only one file. In fact, a presentation may be contained in several files. In this scenario, one file contains the metadata for the whole presentation. This file may also contain all of the media data, in which case the presentation is self-contained. The other files, if used, are not required to be formatted according to the ISO base media file format. The other files are used to contain media data, and they may also contain unused media data or other information. The ISO base media file format is concerned with only the structure of the file containing the metadata. The format of the media-data files is constrained by the ISO base media file format or its derivative formats only in that the media-data in the media files must be formatted as specified in the ISO base media file format or its derivative formats.
In addition to timed tracks, ISO files can contain any non-timed binary objects in a meta box. The meta box can reside at the top level of the file, within a movie box 220, and within a track box 240, but at most one meta box may occur at each of the file level, the movie level, or the track level. The meta box is required to contain a ‘hdlr’ box, indicating the structure or format of the ‘meta’ box contents. The meta box may contain any number of binary items that can be referred, and each one of the binary items can be associated with a file name.
A file may be compatible with more than one format in the ISO family of file formats, and it is therefore not always possible to speak in terms of a single “type” or “brand” for the file. All ISO files contain a file type box indicating which file format specifies the “best use” of the file and also a set of other specifications with which the file complies. The format that is the “best use” of the file is referred to as the major brand of the file, while the other compatible formats are referred to as compatible brands.
The presence of a brand in the list of the compatible brands of the file type box constitutes both a claim and a permission. The presence is a claim in that the file conforms to all the requirements of that brand, and the presence also represents a permission to a reader implementing potentially only that brand to read the file. In general, readers are required to implement all features documented for a brand unless one of the following applies:
1. The media the readers are using do not use or require a feature. For example, I-frame video does not require a sync sample table, and if composition re-ordering is not used, then no composition time offset table is needed. Similarly, if content protection is not needed, then support for the structures of content protection is not required.
2. Another specification with which the file is conformant forbids the use of a feature. For example, some derived specifications explicitly forbid the use of movie fragments.
3. The context in which the product operates means that some structures are not relevant. For example, hint track structures are only relevant to products preparing content for, or performing, file delivery (such as streaming) for the protocol in the hint track.
File readers implementing a certain brand should attempt to read files that are marked as compatible with that brand.
A hint track is a special track which usually does not contain media data. Instead, a hint track contains instructions for packaging one or more tracks for delivery over a certain communication protocol. The process of sending packets is time-based, substantially identical to the display of time-based data, and is therefore suitably described by a track. Due to the presence of hint tracks, the operational load of a sender can be reduced, and the implementation of a sender can be simply compared to a sender constructing protocol data units from media samples without any hints.
The ISO base media file format contains the hint track definition for Real-Time Protocol (RTP) and Secure Real-Time Transport Protocol (SRTP) protocols, and an upcoming Amendment 2 of the ISO base media file format will contain the hint track definition for FLUTE and ALC protocols. A hint track format for MPEG-2 transport stream (TS) may also be specified, e.g., as part of the DVB File Format.
The mdat box depicted in FIG. 2 contains samples for the tracks. In non-hint tracks, a sample is an individual frame of video, a time-contiguous series of video frames, or a time-contiguous compressed section of audio. In hint tracks, a sample defines the formation of one or more packets formatted according to the communication protocol identified in the header of the hint track.
Hint tracks inherit all of the features of regular media tracks, such as timing of the samples and indication of synchronization samples. Hint samples contain instructions to assist a sender to compose packets for transmission. These instructions may contain immediate data to send (e.g., header information) or reference segments of the media data. In other words, the media samples in media tracks do not need to be copied into the samples of the hint tracks, but rather the hint samples point to the samples of the media tracks. Therefore, the media data itself does not need to be reformatted in any way. This approach is more space-efficient than an approach that requires media information to be partitioned into the actual data units that will be transmitted for a given transport and media format. Under such an approach, local playback requires either re-assembling the media from the packets or having two copies of the media—one for local playback and one for transport. Similarly, the transmission of such media over multiple protocols using this approach requires multiple copies of the media data for each delivery protocol. This is inefficient with space unless the media data has been heavily transformed for transport (e.g., by the application of error-correcting coding techniques or by encryption).
If an ISO file contains hint tracks, the media tracks that reference the media data from which the hints were built remain in the file, even if the data within them is not directly referenced by the hint tracks. After deleting all hint tracks, the entire un-hinted presentation remains.
FIG. 3 is a representation of a general video communications system. Due to the fact that uncompressed video requires a huge bandwidth, input video 300 is compressed by a source coder 305 to a desired bit rate. The source coder 305 can be divided into two components—a waveform coder 310 and an entropy coder 315. The waveform coder 310 performs lossy video signal compression, while the entropy coder 315 converts the output of the waveform coder 310 into a binary sequence losslessly. A transport coder 320 encapsulates the compressed video according to the transport protocols in use by interleaving and modulating the data, for example. The data is transmitted to the receiver side via a transmission channel 325. The receiver performs inverse operations to obtain reconstructed video signal for display. The inverse operations include the use of a transport decoder 330 and a source decoder 335 which can be divided into an entropy decoder 340 and a waveform decoder 345, ultimately resulting in output video 350.
Most real-world channels are susceptible to transmission errors. Transmission errors can be roughly classified into two categories—bit errors and erasure errors. Bit errors are caused by physical events occurring in the transmission channel, such as noise and interference. Protocol stacks for real-time media transport typically provide mechanisms such as cyclic redundancy check (CRC) codes for detecting bit errors. It is a common practice to discard erroneous protocol payloads in the transport decoder. The challenges in decoding of erroneous video data lie in the likelihood of bursty bit errors, the exact detection of the position of the error, and variable length coding (VLC) used by the entropy coder. Due to the burstiness of bit errors, it is likely that a large portion of a protocol payload would be non-decodable anyways, and therefore discarding the entire protocol payload does not cause very much unnecessary data exclusion. The error detection mechanisms provided by the communication protocols are typically able to yield a binary conclusion—either the packet is corrupted or it is correct. It is therefore up to source coding layer mechanisms to determine the exact location of errors. Even though there are methods based on syntactic and semantic violations and unnatural texture disruptions for detecting the location of errors, the false detection of bit errors may lead to subjectively annoying video. Due to variable length coding, a single bit error is likely to change the interpretation of the codeword in which it occurs and cause a loss of synchronization of subsequent codewords. Even if codeword synchronization were re-established, it might not be possible to determine the spatial or temporal location of decoded data.
In terms of erasure errors, there are two primary sources of such errors. First, queue overflows in congested network elements, such as routers, cause packet losses. Second, the transport decoder typically processes bit errors by removing the entire packets in which the bit errors occurred.
In general, introduced transmission errors should first be detected and then corrected or concealed by the receiver. As explained above, bit errors are typically detected using CRC or similar codes and corrupted packets are discarded. Communication protocols for real-time media transport typically attach a sequence number that is incremented by one for each transmitted packet, and therefore packet losses can be detected from a gap in the sequence number values of consecutive packets. Error correction refers to the capability to recover the erroneous data perfectly as if no errors would have been introduced in the first place. Error concealment refers to the capability to conceal the impacts of transmission errors so that they should be hardly visible in the reconstructed video. Typically, some amount of redundancy is added to source or transport coding in order to help in error detection, correction and concealment.
Error correction and concealment techniques can be roughly classified into three categories—forward error concealment, error concealment by postprocessing and interactive error concealment. Forward error concealment refers to those techniques in which the transmitter side adds such redundancies to the transmitted data so that the receiver can easily recover the transmitted data even if there were transmission errors. Error concealment by postprocessing is totally receiver-oriented. These methods attempt to estimate the correct representation of erroneously received data. The transmitter and the receiver may also co-operate in order to minimize the effect of transmission errors. These methods utilize heavily the feedback information given by the receiver. Error concealment by postprocessing is also referred to as passive error concealment, while the other two categories represent forms of active error concealment.
An orthogonal classification of error correction and concealment algorithms, compared to the categorization introduced above, is based on the protocol stack layer in which the algorithm in question operates. Methods in the physical layer may, for example, use modulation intelligently or interleave data bits to be transmitted. In the link layer, erroneously received data blocks may by selectively retransmitted, for instance. In general, the methods involving the source coder or the source decoder are referred to as media-aware error correction and concealment algorithms, while methods that operate solely in the transport coder and decoder are media-independent. Methods requiring the interoperation of several protocol stack layers fall into the category of cross-layer optimization algorithms. The term “joint source-channel coding” is used when source and transport coding operate seamlessly to tackle transmission errors as a joint effort.
For many real-time multimedia communication applications, it is desirable to not have a multimedia file transmitted as a file, but instead have the media data encapsulated into packets of a communication protocol. Furthermore, it is desirable for existing media players to be capable of parsing, decoding, and playing any multimedia file that is generated from received media streams. If any recorded multimedia file can be played by existing media players, the media players do not have to be updated or changed.
Most, if not all, container file formats are targeted for the playing of error-free files that are reliably transferred to the playing device and/or for providing media content for transmission in streaming servers or other sending devices. Consequently, the container file formats do not provide mechanisms for indicating transmission errors, and it is not guaranteed that existing players would be able to cope with erroneous media streams gracefully. Instead, such players may crash or behave otherwise in unexpected ways. It would therefore be desirable that files generated from received media streams be played with existing media players and would be compatible with existing file formats. Furthermore, it would be desirable for sophisticated players and decoders to include mechanisms for efficiently concealing transmission errors from received streams that are recorded to a file.
There have been a number of conventional approaches for addressing at least some of the issues identified above. In a first approach, the received transport stream is included as such in the file, or the transport stream is stored in a separate file, and the separate file is referred to from the presentation file (i.e., the file containing the metadata). In this arrangement, the transport stream refers to the lowest protocol stack layer that is considered relevant in the application. For RTP-based media transmission, the transport stream typically refers to a stream of RTP packets. When elementary media streams are encapsulated to an MPEG-2 transport stream (as in DVB-T, DVB-C, and DVB-S), the transport stream refers to the MPEG-2 transport stream. In the ISO base media file format structure, the transport stream can be included as a single sample into the media track. This is how MPEG-2 transport streams are included in QuickTime files. Metadata specific to the transport stream may be stored in a new structure of the file format; in the ISO base media file format, the structure may reside in the meta box.
In a second approach, the received transport stream is converted to elementary data tracks. Metadata specific to the transport stream is stored in a new structure of the file format; in the ISO base media file format, the structure resides in the metabox.
In a third approach, received transport packets of a stream are written as such to a hint track of the file that is recorded. However, the use of a hint track is not a valid solution logically, as hint tracks provide packetization instructions for a server or, more generally, for a sender. Moreover, a recorded hint track may not provide a valid stream to be re-sent. For example, RTP sequence numbers are required to be continuous in a transmitted stream, but in a recorded stream a missing packet causes a discontinuity in RTP sequence numbers.
The fact that the moov box can be completed only after all of the media data is received makes continuous recording to a single file impossible in the second and third approaches discussed above. This problem can be avoided when the movie fragment feature is used to segment the recorded file as described in U.S. patent application Ser. No. 11/292,786, filed Dec. 1, 2005. Alternatively, the media data of the received streams can be recorded to separate files compared to the meta data. However, if simultaneous time-shifted playback of the file being recorded is desired, then movie fragments as described in U.S. patent application Ser. No. 11/292,786 should be used.