1. Field of the Invention
This invention relates generally to the field of data processing systems. More particularly, the invention relates to an improved system and method for translating media files between formats using a universal representation.
2. Description of the Related Art
A variety of different encoding formats exist for digital audio such as the ISO MP-4 file format (.mp4 files), the iTunes file format (.m4a files), and the QuickTime file format (.mov files). While these different file formats maintain indexing information and other metadata somewhat differently, all three file types may use the same codec to encode the underlying audio content (e.g., such as Advanced Audio Coding (AAC)).
The process of encoding a source audio stream into a sequence of AAC audio packets (the compressed domain) introduces some amount of “encoder delay” (sometimes called “priming” and measured in audio samples). When these audio packets are subsequently decoded back to the Pulse Code Modulation (PCM) domain, the source waveform will be offset in its entirety by this encoder delay amount. Additionally, encoded audio packets will typically each carry a fixed number of audio samples (e.g., 1024) possibly requiring additional trailing or “remainder” samples following the last source sample so as to pad the final audio packet.
Technically, the AAC encoding uses a transform over consecutive sets of 2048 samples, but applied every 1024 samples (hence overlapped). For correct audio to be decoded, both transforms for any period of 1024 samples are needed. For this reason, encoders add at least 1024 samples of silence before the first “true” audio sample, and often add more (commonly 2112 samples, for various reasons). The term “encoder delay” used to refer to these samples is perhaps a confusing term as it refers to an offset and extra data in the encoded stream, and not (for example) to a real-time delay, or a delay between feeding data into an encoder or decoder and getting data out. However the term “encoder delay” is commonly used by those of ordinary skill in the art and will be used in the present application.
By way of example, FIG. 1a illustrates an uncompressed source audio that may be encoded to a sequence of AAC audio packets (aka “access units” or “AUs”) as shown in FIG. 1b. As illustrated, the audio is quantized into packets and offset by the “priming” duration. Additionally, there may be “remainder” samples following the end of the source to account for filling to the packet sample count size.
Consequently, to recover and present the original waveform from the compressed audio packets, it is necessary to trim the decoded audio samples within this encoder delay period and to trim any remainder audio samples as shown in FIG. 1c. Additionally, this overhead should not be accounted for in the duration of the track as these samples are an artifact of the encoding process and don't represent useful signal. As the amount of encoder delay may vary depending upon the encoder (software or hardware) and the encoder configuration used, it is necessary that the media container for storing the audio content indicate the placement of the source signal in the compressed stream.
The three file formats mentioned above, .mp4, .m4a, and .mov, each use a different media container format to indicate the placement of the original source signal. An .mp4 file, for example, typically uses an “edit list” data structure to indicate what range of time from the access units to present. An .m4a file does not use an edit list but instead uses metadata associated with the file containing “priming,” “duration,” and remainder” values to indicate the location of audio content within the file. Finally, a .mov file uses an edit list but includes an implicit offset to identify the start of the audio content within the file.
By way of example, and not limitation, FIGS. 2a-c illustrate how this bookkeeping data is stored using these three file types assuming 1024 samples per access unit (aka “audio packet”), an encoder delay of 2112 samples, an audio sample duration of 240000 samples, and a remainder of 576 samples. As shown in FIG. 2a, the Edit List (aka “EditListBox”) data structure 201 specifies an edit media start time of 2112 samples with a duration of 24000 samples. The remainder of 576 samples is assumed since each access unit is 1024 samples and the above example requires 237 access units given the encoder delay and audio sample duration (i.e., 237 AUs*1024 samples/AU=242688 samples total and 242688−(2112+240000)=576). As shown in FIG. 2b, instead of an Edit List data structure, the .m4a file includes metadata 202 with a priming value of 2112 samples, a duration of 240000 samples, and a remainder of 576 samples. Finally, as illustrated in FIG. 2c, the .mov file uses an Edit List data structure, but the value of 2112 samples is implied for the encoder delay. That is, the Edit List data structure 203 specifies an edit media start time of zero even though the audio content contained in the file does not start until after the first 2112 audio samples. Thus, any application reading the .mov file must know to add 2112 audio samples to the edit media start time specified in the edit list (i.e., the entirety of an edit to a .mov file is shifted by an implicit 2112 audio samples).