The relative temporal alignment of video and audio signals is an important factor in the perceived quality of audio-video content. One common example referred to as “lip sync” is the relative temporal alignment between the moving image of a person's lips and the sound of speech uttered by that person. Various studies have shown that, if sound is related to a moving image, human observers generally are either unaware of or are tolerant of differences in the relative temporal alignment of the image and sound if that difference is within some range. According to ITU-R Recommendation BT.1359-1, “Relative Timing of Sound and Vision for Broadcasting,” if a sound precedes an associated visual event by no more than about 20 msec. or follows an associated visual event by no more than about 95 msec., the difference in temporal alignment is generally imperceptible. If a sound precedes an associated visual event by more than about 90 msec. or follows the associated visual event by more than about 185 msec., the difference in temporal alignment is perceived and generally found to be unacceptable. For purposes of this disclosure, video and audio signals are regarded as having the proper temporal alignment or as being in synchronization with one another if any difference in the relative alignment is either imperceptible or at least acceptable to a broad class of human observers.
Unfortunately, many methods and systems that process, distribute and present audio-video content often include mechanisms that cause proper synchronization to be lost. In broadcasting, for example, video and audio signals are usually synchronized at the point of signal capture such as in a studio but these signals are often processed prior to broadcast transmission and this processing can cause loss of synchronization. For example, analog video and audio signals may be converted into digital forms and processed by perceptual encoding methods to reduce the bit rate or bandwidth needed to transmit the content. Processes such as chroma-keying may be used to merge images from multiple video signals. Ancillary audio signals may be mixed with or replace the original audio signal. Many of these and other processes introduce delays in the signal processing paths. If processing delays are not precisely equal in the video signal processing path and the audio signal processing path, loss of synchronization is inevitable. In addition, synchronization is often lost if video and audio signals are distributed independently through different channels.
To avoid these problems, a variety of techniques have been proposed and used that search for matches between received video/audio content and reference video/audio content known to be synchronized, calculate a change in temporal alignment between the received video and audio content relative to the alignment between reference content, and delay the received video content or the received audio content to reestablish synchronization. One limitation of other known techniques is that they do not account for the reliability of the match or the reliability of the calculated change in alignment.