In video telephony, teleconferencing and multimedia applications, due to limited bandwidth or storage space, a video coder typically cannot encode all incoming video frames because the transmission of each video frame requires a very substantial number of bits to be transmitted. Instead, the video coder typically drops some frames by subsampling the video at a fraction of the normal rate and encodes only at the low frame rate, which can be as low as one to two frames per second for some applications. This subsampling technique, known as frame skipping, results in a jerky motion of the images in the video signal and a loss of synchronization between the video and audio signals. Additionally, because a typical speaking person can enunciate more than ten sounds per second, the positions of the lips, jaws, teeth and tongue change at high rates. Consequently, during human speech, at sampling rates of only one to two frames per second, most mouth movement is lost in the video signal when frame skipping is employed. Thus, during teleconferencing, for example, lip movements of a speaking person (a talking head) typically do not match the words actually spoken.
Studies of human speech perception have demonstrated that human perception of acoustic speech can be affected by the visual cues of lip movements. For example, if a video shows a speaker's mouth saying "ga" but the audio is dubbed with the sound "ba", a viewer/listener frequently understands "da", a completely different message. Similarly, a visual "ga" combined with an audio "pa" is often perceived as "ta", and a visual "da" combined with an audio "ma" is often perceived as "na". This confusion is known as the "McGurk Effect". Thus, it is clear that lip reading is used, to some extent, by most people, even those who are not hearing impaired, to clarify their audio perception, especially when background noise levels are high. Lip reading obviously cannot be used by a listener if the audio speech does not match with the video picture of the speaking person. It is thus clear that synchronization of bimodal speech, video and audio signals, is an important goal in human perception of speech.
Although various techniques, such as linear interpolation and motion-adaptive interpolation, have been used to smooth out the jerkiness of images that are generated in the frame skipping technique typically used in video conferencing, these techniques are unable to reproduce mouth movements. Consequently, these methods do not aid in the human perception of a teleconferencing signal.
Techniques have been developed to create an animated video representation of human lips on a face based upon an audio speech signal. See, for example, Lavagetto, "Converting Speech into Lip Movements: A Multimedia Telephone for Hard of Hearing People", IEEE Transactions On Rehabilitation Engineering, Vol. 3, No. 1, March 1995, pp. 90-102; Morishima et al., "An Intelligent Facial Image Coding Driven by Speech and Phoneme", I.C.A.S.S.P. '89, pp. 1795-8; Chen, et al., "Speech-Assisted Video Processing: Interpolation and Low-Bitrate Coding", 28th Asilomar Conference, Pacific Grove, October 1994; and AT&T U.S. patent application Ser. No. 08/210,529. Methods have also been developed to isolate a human face from a background video picture and then to locate the lips on the face. See, for example, Rao et al., "On Merging Hidden Markov Models With Deformable Templates", I.C.I.P. '95, October 1995. These references, the disclosures of which are incorporated herein by reference, do not disclose or suggest a method or apparatus for transmitting a video signal in real-time synchronization with an audio signal to display a human speaker's lips without any frame skipping in the video signal.