Human beings can detect very small differences in the synchronization between video and its accompanying audio soundtrack. People are especially good at recognizing a lack of synchronizations between the lips of speakers in video media, such as in a video conference session, and the reproduced speech they hear. Lip-synchronization (“lipsync”) between audio and video is therefore essential to achieving a high quality conferencing experience.
Past approaches for ensuring good lipsync include reliance upon the Real-Time Transport Control Protocol (RTCP) mapping between timestamps generated by the separate audio and video encoding devices and a Network Time Protocol (NTP) “wall clock” time. Although useful in achieving good lipsync for point-to-point audio/video sessions, this approach breaks down in situations such as where audio and video mixers are inserted for multi-party conferences. Other approaches to the problem of lipsync include co-locating the audio and video mixers on the same device, such as is typically done in most video multipoint control units (MCUs). However, when the audio mixer and video mixer/switch are distributed one or the other of the audio or video mixing devices typically must delay its output stream to ensure that arrival time of the audio and video packets provides adequate lipsync, which can be difficult to achieve.