When related audio and video content are rendered together (to be observed at the same time), the audio and video signals need to be time aligned or the observer will recognize a lip sync′ error. This error is named lip sync because observers are keenly aware of it when the sight of a person's lips does not match timing of the accompanying sound of the person's voice. Lip sync error is due to audio and video signals being presented with different amounts of delay. The error is corrected by delaying the earlier signal (almost always the audio signal).
Lip sync error is a concern because video processing generally induces delays that are significantly longer than audio processing delays. Some studies indicate observers will notice lip sync errors where the audio leads (is more advanced than) the video by more than 45 ms (milliseconds) and where the audio lags (trails) the video by more than 125 ms. The recommendation of the ATSC (Advanced Television Systems Committee) Implementation Subcommittee IS-191 is to align related audio and video signals within the range of −15 ms (audio leads) to 45 ms (audio lags).
Further, audio signals that are rendered together may produce a noticeable delay or echo if not sufficiently time aligned. A human observer will hear two sounds separated by a sufficiently short delay as a single, fused auditory image (the Haas effect). The maximum delay (called the echo threshold) varies according to the type of sound and circumstances, and may range from about 5-40 ms. Hence, audio signals that are rendered together may need a relative delay of less than about 40 ms to achieve auditory fusion. Even below the threshold for audio fusion, comb filtering effects may be heard if the same audio signal is rendered by separate transducers that produce a relative delay of a few milliseconds or less.