The ability to precisely synchronize audio and video data is crucial to the electronics, entertainment and communications industries. However, substantial design challenges remain inherent to the digital signal processing (DSP) techniques used to achieve synchronicity. For example, audio signals must be separated and independently processed from their corresponding video signals. Further, the processing times of the audio and video data vary as functions of both their respective sampling rates and of the hardware used in processing applications. Still, industry standards demand that the playback of the audio and video be synchronized, providing for a coordinated and coherent reproduction of the source material.
A program source often formats the audio and video data in respective data packets according to Moving Picture Expert Group (MPEG) principles. This format allows for each of the audio and video data packets to be received from the source in a continuous data stream for ease of storage and transmission. Packets of video data separated from the data stream include header blocks that are followed by data blocks. The data blocks may include a full field of video data or a coded group of pictures that includes its own header block identifying the picture type and display order. The header block for a video data packet includes control information, such as format identification and compression information, picture size, display order, and other global parameters.
Similarly, audio data packets have header blocks that identify the format of the audio data along with instructions relating to the encoding parameters of the audio samples. Such parameters include bit rate, compression information, as well as sampling frequency identification. Additional processing instructions may be provided for desired enhancements, if applicable. Following the header block, the audio data packet contains any number of audio frames corresponding to the video data.
Selected header blocks include presentation time stamp (PTS) values that indicate the decoding time for a frame of video data or a batch of audio samples. The time stamp value is a time reference to a system time clock that was running during the creation or recording of the audio and video data. A similar system time clock is also running during the playback of the audio and video data.
During the decoding of the audio data, audio samples must normally be decompressed, reconstructed and enhanced in a manner consistent with the source of program material and the capabilities of the sound reproduction system. In some applications, audio data packets may contain up to six channels of raw audio data. Depending on the number of channels the sound reproduction system can reproduce, the system selectively uses the channels of raw audio data to provide a number of channels of audio that are then stored in an audio first-in, first-out (FIFO) memory. The decoding of the video data likewise requires decompression, as well as the conversion of partial frames into full frames prior to storage in a video FIFO.
The FIFOs have write and read pointers that are controlled by a memory controller. The controller, in turn, is under the general control of a CPU. The write pointers are driven according to the requirements of the demultiplexing process, which sequentially delivers data to each of the FIFOs. The read pointers are driven as a function of independent and parallel decoding process, which sequentially reads data from the FIFOs. While the data is being loaded into the FIFO memories by the demultiplexing process, audio and video data is simultaneously and in parallel being read from the respective FIFOs during decoding and playback processes.
A host, or suitable microprocessor, monitors the audio and video decoding processes and coordinates the rates that the video and audio data are output to the multiplexor for eventual combination. The output frequency of audio samples is calculated by multiplying the number of samples in the audio block by the audio sampling rate. The output frequency of the video signal is slaved to the video synchronization signal. Ideally, the sampling intervals at which the video data and the audio samples are decoded would coincide. Further, if the audio and video data could be processed and played back at the times represented by their time stamps, the data will be presented to the user in the desired, synchronized manner.
However, the differences in the processing of the audio and video data in separate, parallel bit streams does not facilitate such precise timing control. The loss of synchronicity is in part attributable to a sampling discrepancy between the video synchronization signal and the audio sampling rate. Namely, the frequency of the video signal is 29.97 Hz, while audio samples clock at 32 kHz, 44.1 kHz or 48 kHz. Furthermore, there are a fractional number of 32 kHz, 44.1 kHz or 48 kHz audio samples. The inherent sampling size differential translates into a loss of synchronization on the order of one part per thousand, i.e., 60.0 Hz*1000/1001=59.94 Hz, fractional sample rate offset of 525/60 video relative to its nominal 60 Hz field rate. This sampling disparity causes the analog/digital converter to incrementally read the audio and video out of synchronicity. Over time, accumulated losses of synchronization can compound to the point where the loss of synchronization is perceptible to the user.
DSP techniques are used to compensate for differences between the audio/video sampling rates. One method of mitigating processing error involves manipulating the buffer rate, or the rate at which data is transferred and accepted into the decoder buffer. Similar rate adjustment may be effected when the data is transferred out of the buffer. In the case of video this can be done by adjusting the frame rate. In the case of audio, this is accomplished by adjusting the sampling rate. However, such rate adjustments involve extensive programming and processing delays. Further, adjustments of the decoder and transfer bit rate are restricted by characteristics of the peripheral hardware. Therefore, if the buffer error (i.e. deviation from the ideal buffer fullness) is too large, the appropriate control can become difficult or impossible.
Other DSP techniques skip or repeat frames of video data or batches of audio samples in order to control the buffer output data rate. Still another method adjusts the system time clock prior to repeating frames.
However, such applications, while achieving synchronization, sacrifice precision by materially altering a portion of the original source data.
Other techniques for achieving synchronization involve reducing the audio sample rate by one part per thousand, rounding up the published rate, i.e., by publishing enough significant digits to show the error, and calling that rate “synchronized to video.” Thus 44.056 kHz becomes “44.1 kHz synchronized to video” and 47.952 kHz becomes “48 kHz synchronized to video.” However, this approach can be misleading to the consumer and is incompatible with standard sample rates.
Still another technique blocks the audio data into unequal frames of audio. For instance, digital video tape recorders format data into a five frame, i.e., ten field, sequence using multiple, unequal audio frames of 160 and 161 samples. This unequal block format also requires a separate linear control track containing the frame sequence, and is suboptimal for field-based digital disk recording.
The same buffers that play an integral role in the above DSP techniques are themselves susceptible to storage and transfer errors that contribute to synchronization loss. A common example of such an error results from the varying processing requirements of individual audio DSP microchips. Namely, every chip requires a unique amount of start-up time prior to encoding in order to prepare for the encoding parameters of incoming data. Encoding parameters identify such critical encoding characteristics as the sampling frequency and bit rate of a frame, which determines the compression ratio. Thus, inconsistent start-up delays between audio and video DSP microchips conducting parallel applications further disrupt synchronization efforts.
The graph of FIG. 1 illustrates the relative timing activities and inconsistencies of an audio and video encoder in accordance with the prior art.
For purposes of the graph, an elevated value of an encoder signal indicates that the encoder is actively processing a data packet. For example, the video encoder signal 102 of FIG. 1 indicates that the video encoder begins encoding a video data packet coincident with the raised edge of the signal at t=5. A corresponding video synchronization clock signal 103 is also depicted for comparison purposes.
The graph shows the disparity between the activities of the audio and video encoders that results in the audio packet being encoded at a point some n samples after the video encoding process was initiated. In temporal terms of the graph, the video encoding process and signal 102 begin at t=5 while the audio encoding signal 101 does not become active until t=5+n. As further evidenced by the encoding signals 101, 102, the video encoding process 102 for a data packet ends at t=7, while the audio signal 101 continues until t=7+m. As discussed above, these encoding differentials cause a loss of synchronization between the audio and video signals.
Consequently, in a system such as that described above, there is a need to improve the synchronization of digital audio with digital video in such a manner that does not require repeating or losing data, restricting the sample rate, or relying upon unequal block formatting.