Several international standards have been developed which define various aspects of embedding digital audio information into frames of video information. For example, standard SMPTE 259M published by the Society of Motion Picture and Television Engineers (SMPTE) defines a Serial Digital Interface (SDI) in which up to four channels of digital audio information may be embedded into component and composite serial digital video signals. Standard SMPTE 272M provides a full definition of how digital audio information is to be embedded in ancillary data spaces within frames of the video information.
The serial transmission of digital audio information itself is the subject of various international standards. For example, standard AES3 (ANSI S4.40) published by the Audio Engineering Society (AES), defines serial transmission of two-channel digital audio represented in a linear pulse code modulation (PCM) form. According to this standard, PCM samples for two channels are interleaved and conveyed in pairs.
A common activity in nearly all recording and broadcasting applications is editing or cutting embedded video/audio information streams and splicing the cut information streams to form a new single stream. Similar activities generate an information stream by merging multiple information streams or by switching between multiple streams. The video information is normally the primary synchronizing reference so that an edit or cut point is normally aligned with a video frame.
Standards such as AES11 define recommended practices for synchronizing digital audio equipment in studio operations. AES11 is directed toward controlling timing uncertainties caused by jitter or processing delays and provides for aligning video frame information with the two-sample frames of AES3 digital audio information streams. Equipment and methods that adhere to this standard can ensure that synchronized signals have the same number of frames over a given period of time and contain samples that have a common timing. Unfortunately, no standards or practices currently exist which define an alignment between video information and larger intervals of audio information. As a result, equipment from different manufacturers and even from the same manufacturer have variations in timing and in processing delays that introduce a significant amount of uncertainty in the relative alignment of audio and video information.
This uncertainty in alignment is of little consequence in applications that use linear representations of audio information such as that defined in the AES3 standard. Because edit points are constrained to occur between the two-sample frames of audio information, any uncertainty in video/audio alignment will not result in the loss of audio information. It will only affect the relative timing of sound and picture as presented to a person, which is unlikely to be discernable.
There is, however, a growing number of applications that use bit-rate-reduction encoding techniques to embed greater numbers of audio channels into a video/audio data stream. These encoding techniques are often applied to sample blocks of 128 or more audio samples to generate blocks of encoded information. These sample blocks typically represent audio information that spans an interval of 3 to 12 ms. Each block of encoded information generated by these encoding processes represents the smallest unit of information from which a reasonably accurate replica of a segment of the original audio information can be recovered. Split-band coding techniques reduce bit rates by applying psychoacoustic-based coding to frequency-subband representations of an audio signal. The frequency-subband representations may be generated by application of a plurality of bandpass filters or one or more transforms. For ease of discussion, these split-band coding techniques are described here in terms of applying a filterbank to generate subband signals.
The uncertainty in alignment mentioned above is significant in these block-coding applications because an edit point falling within the boundaries of an encoded block will result in part of that block being cut from the remaining signal. The partial loss of an encoded block will be manifested by a loss in the recovered signal for a duration typically of 3 ms or more. It is likely that such a loss would be discernable to the human auditory system.
This problem may be avoided by using a post-processing process in which a PCM representation of the original audio signals is recovered by applying a decoding process to the encoded audio, editing the recovered PCM representation as required, and generating a new encoded representation by applying an encoding processed to the edited PCM audio information. This solution is unattractive because of the additional costs and degradation in audio quality resulting from the decoding/re-encoding processes. In addition, for reasons that will be better understood after reading the discussion set forth below, post-processing is unattractive because the decoding/re-encoding processes introduce additional delays in the audio information stream.