There are various different file structures used today to store time-based media: audio formats such as AIFF, video formats such as AVI, and streaming formats such as RealMedia. One reason that such file structures are different is their different focus and applicability. Some of these formats are sufficiently relatively widely accepted, broad in their application, and somewhat simple to implement, and thus, may be used not only for content delivery but also as interchange formats. Foremost among these general formats is the QuickTime file format. It is used today in the majority of web sites serving time-based data; in the majority of authoring environments, including professional ones; and on the majority of multimedia CDROM titles.
The QuickTime media layer supports the efficient display and management of general multimedia data, with an emphasis on time-based material (video, audio, etc.). The media layer uses the QuickTime file format as the storage and interchange format for media information. The architectural capabilities of the layer are generally broader than the existing implementations, and the file format is capable of representing more information than is currently demanded by the existing QuickTime implementations.
In contrast to formats such as AVI, which were generally designed to support local random access of synchronized media, QuickTime allows systems to manage the data, relationships and timing of a general multimedia presentation. In particular, the QuickTime file format has structures to represent the temporal behavior of general time-based streams, a concept which covers the time-based emission of network packets, as well as the time-based local presentation of multimedia data.
The existing QuickTime file format is publicly described by Apple Computer in the May 1996 File format specification, which may be found at the QuickTime site, <http://.www.apple.com/quicktime>.
One aspect of the QuickTime file format is the concept that the physical structure of media data (the layout in disk records) is independent of, and described by, a logical structure for the file. The file is fully described by a set of “movie” meta-data. This meta-data provides declarative, structural and temporal information about the actual media data.
The media data may be in the same file as the description data, (the “movie” meta-data), or in other file(s). A movie structured into one file is commonly called “flat”, and is self-contained. Non-flat movies can be structured to reference some, or all, of the media data in other files.
As such, the format is generally suited for optimization in different applications. For example, when editing (compositing), data need not be rewritten as edits are applied and media is re-ordered; the meta-data file may be extended and temporal mapping information adjusted. When edits are complete, the relevant media data and meta-data may be rewritten into a single, interleaved, and optimized file for local or network access. Both the structured and the optimized files are valid QuickTime files, and both may be inspected, played, and reworked.
The use of structured (“non-flat”) files enables the same basic media data to be used and re-used in any number of presentations. This same advantage applies when serving, as will be seen below.
In both editing and serving, this also permits a number of other files to be treated as part of a movie without copying the media data. Thus editing and serving may be done directly from files such as Sun Microsystem's “au” audio format or the AVI video format, greatly extending the utility of these formats.
The QuickTime file is divided into a set of objects, called atoms. Each object starts with an atom header, which declares its size and type:
class Atom {   int(32) size;   char type[4];   byte contents[ ];}
The size is in bytes, including the size and type header fields. The type field is four characters (usually printable), to permit easy documentation and identification. The data in an object after the type field may be fields, a sequence of contained objects, or both.
A file therefore is simply a sequence of objects:
class File {   Atom[ ];}
The two important top-level objects are the media-data (mdat) and the meta-data (moov).
The media-data object(s) contain the actual media (for example, sequences of sound samples). Their format is not constrained by the file format; they are not usually objects. Their format is described in the meta-data, not by any declarations physically contiguous with them. So, for example, in a movie consisting solely of motion-JPEG, JPEG frames are stored contiguously in the media data with no intervening extra headers. The media data within the media data objects is logically divided into chunks; however, there are no explicit chunk markers within the media data.
When the QuickTime file references media data in other files, it is not required that these ‘secondary’ files be formatted according to the QuickTime specification, since such media data files may be formatted as if they were the contents of a media object. Since the QuickTime format does not necessarily require any headers or other information physically contiguous with the media data, it is possible for the media data to be files which contain ‘foreign’ headers (e.g. UNIX “.au” files, or AVI files) and for the QuickTime meta-data to contain the appropriate declarative information and reference the media data in the ‘foreign’ file. In this way the QuickTime file format can be used to update, without copying, existing bodies of material in disparate formats. The QuickTime file format is both an established format and is able to work with, include, and thereby bring forward, other established formats.
Free space (e.g. deleted by an editing operation) can also be described by an object. Software reading a file that includes free space objects should ignore such free space objects, as well as objects at any level which it does not understand. This permits extension of the file at virtually any level by introducing new objects.
The primary meta-data is the movie object. A QuickTime file has exactly one movie object which is typically at the beginning or end of the file, to permit its easy location:
class Movie {   int(32)size;   chartype[4] = ‘moov’;   MovieHeadermh;   contentsAtom[ ];}
The movie header provides basic information about the overall presentation (its creation date, overall timescale, and so on). In the sequence of contained objects there is typically at least one track, which describes temporally presented data.
class Track {   int(32)size;   chartype[4] = ‘trak’;   TrackHeaderth;   contentsAtom[ ];}
The track header provides relatively basic information about the track (its ID, timescale, and so on). Objects contained in the track might be references to other tracks (e.g. for complex compositing), or edit lists. In this sequence of contained objects there may be a media object, which describes the media which is presented when the track is played.
The media object contains declarations relating to the presentation required by the track (e.g. that it is sampled audio, or MIDI, or orientation information for a 3Dscene). The type of track is declared by its handler:
class handler {   int(32) size;   char type[4] = ‘hdlr’;   int(8) version;   bit(24) flags;   char handlertype[4];  -- mhlr for media handlers   char handlersubtype[4] -- vide for video, soun for audio   char manufacturer[4];   bit(32) handlerflags;   bit(32) handlerflagsmask;   string componentname;}
Within the media information there is likewise a handler declaration for the data handler (which fetches media data), and a data information declaration, which defines which files contain the media data for the associated track. By using this declaration, movies may be built which span several files.
At the lowest level, a sample table is used which relates the temporal aspect of the track to the data stored in the file:
class sampletable {   int(32)  size;   char  type[4] = ‘stbl’;   sampledescriptionsd;   timetosampletts;   syncsampletablesyncs;   sampletochunkstoc;   samplesizessize;   chunkoffsetcoffset;   shadowsyncssync;}
The sample description contains information about the media (e.g. the compression formats used in video). The time-to-sample table relates time in the track, to the sample (by index) which should be displayed at that time. The sync sample table declares which of these are sync (key) samples, not dependent on other samples.
The sample-to-chunk object declares how to find the media data for a given sample, and its description given its index:
class sampletochunk {   int(32) size;   char type[4] = ‘stsc’;   int(8) version;   bits(24) flags;   int(32) entrycount;   for (int i=0; i<entrycount; i++) {     int(32)   firstchunk;     int(32)   samplesperchunk;     int(32)   sampledescriptionindex;   }}
The sample size table indicates the size of each sample. The chunkoffset table indicates the offset into the containing file of the start of each chunk.
Walking the above-described structure to find the appropriate data to display for a given time is fairly straightforward, generally involving indexing and adding. Using the sync table, it is also possible to back-up to the preceding sync sample, and roll forward ‘silently’ accumulating deltas to a desired starting point.
FIG. 1 shows the structure of a simple movie with one track. A similar diagram may be found in the QuickTime file format documentation, along with a detailed description of the fields of the various objects. QuickTime atoms (objects) are shown here with their type in a grey box, and a descriptive name above. This movie contains a single video track. The frames of video are in the same file, in a single chunk of data. It should be noted that the ‘chunk’ is a logical construct only; it is not an object. Inside the chunk are frames of video, typically stored in their native form. There are no required headers or fields in the video frames themselves.
FIG. 2 is a diagram of a self-contained file with both an audio and a video track. Fewer of the atoms are shown here, for brevity; the pointers from the tracks into the media data are, of course, the usual sample table declarations, which include timing information.
The QuickTime file format has a number of advantages, including:                1) Scalability for size and bit-rates. The meta data is flexible, yet compact. This makes it suitable for small downloaded movies (e.g. on the Internet) as well as providing the basis for a number of high-end editing systems.        2) Physical structure is independent of the logical and temporal structure. This makes it possible to optimize the physical structure differently depending on the use the file will have. In particular, it means that a single file format is suitable for authoring and editing; downloading or placing on CDROMs; and for streaming.        3) The file format has proven capable of handling a very broad variety of codec types and track types, including many not known at the time the format was designed. This proven ability to evolve in an upwards-compatible fashion is fundamental to the success of a storage format.        
Scalable, or layered, codecs can be handled in a number of ways in the QuickTime file format. For a streaming protocol which supports scalability, the samples may be tagged with the layer or bandwidth threshold to be met for transmitting the samples.
Tracks which form a set of alternatives (e.g. different natural language sound tracks) can be tagged so that only one is selected for playback. The same structure can be used to select alternatives for streaming (e.g. for language selection). This capability is described in further detail in the QuickTime file format.
When QuickTime displays a movie or track, the appropriate media handler accesses the media data for a particular time. The media handler must correctly interpret the data stream to retrieve the requested data. For example, with respect to video media, the media handler typically traverses several atoms to find the location and size of a sample for a given media time. The media handler may perform the following:                1. Determine the time in the media time coordinate system.        2. Examine the time-to-sample atom to determine the sample number that contains the data for the specified time.        3. Scan the sample-to-chunk atom to discover which chunk contains the sample in question.        4. Extract the offset to the chunk from the chunk offset atom.        5. Find the offset within the chunk and the sample's size by using the sample size atom.        
It is often desirable to transmit a QuickTime file or other types of time related sequences of media data over a data communication medium, which may be associated with a computer network (e.g. the Internet). In many computer networks, the data which is transmitted into the network should generally be in a packet form. Normally, time related sequences of media data are not in the proper packetized format for transmission over a network. For example, media data files in the QuickTime format are not in a packetized format. Thus, there exists a need to collect the data, sometimes referred to as streaming data, into packets for transmission over a network.
One prior approach to address the problem of transmitting time related sequences of media data over a network is to send the media file over the network using a network or transmission protocol, such as the Hypertext Transfer Protocol (HTTP). Thus, the media file itself is sent from one computer system over the network to another computer system. However, there may be no desire to retain the media file at the receiving computing system. That is, when the media file is received and viewed or listened to at the receiving computer system, there may be no desire by the user of that receiving computer system to store a copy of the file, for example, if the receiving computing system is a network computer or a computer with low storage capacity.
Another alternative approach to solving the problem of how to collect data for transmission by packets over a network is to prepare a file which contains the network protocol data units in the file for a particular transmission protocol. In a sense, such a file may be considered a packetized file which is stored in essentially the same format as it will be transmitted according to the particular transmission protocol. Performing this operation generally involves storing the file in a packetized form for a particular network protocol at a particular data transmission rate and a particular media file format. Thus, for each different transmission protocol at a particular data transmission rate, the file will essentially be replicated in its packetized form. The fixed form of such files may restrict their applicability/compatibility and make it difficult to view such files locally. Thus, such an approach may greatly increase storage requirements in attempting to provide the file in various transmission protocols at various different data transmission rates. Moreover, each packetized file generated according to this alternative prior approach is generally limited to a particular media file format, and thus, other media file formats for the same media object (e.g. a digital movie) are typically packetized and stored on the sending computer system.
Yet another approach to solving the problem of how to stream time related sequences of media data is to perform the packetization of the media data when required on the transmitting system according to the particular transmission protocol which is desired. This processing requires, in many cases, a relatively considerable amount of time, and thus, may slow the performance of the transmitting system.
Thus, it is desirable to provide an improved method and apparatus for transmitting time related sequences of media data.