Speech which is synthetically produced by a machine, such as a computer, is useful for transforming written text into audible speech, for uses such as for automated reading to a person who is unable to read, or for automated phone answering systems.
For some applications, it may be desirable to add a dynamic visual image, such as a facial image on a display device, to the audible speech, such that the facial image appears to be speaking. Such an image would be particularly useful in a speech reading and learning device, especially where mouth posture is important, such as teaching hearing impaired individuals to lip read. Other uses could be in computer interfaces, animation, audio-visual presentations, or interactive kiosks (such as automated bank tellers, ticket booths, automated flight planners, information booths in hotel or office buildings and the like).
An important problem that is encountered when adding a facial image to the audible speech is to make the audio-visual speech realistic. Creating a realistic appearance requires synchronizing the video portion of the speech particularly the mouth and lips of a facial image, with the sound. One conventional method for synchronizing the video and audio portions of the synthetic speech is the "frame by frame" technique described below.
Unfortunately, the frame-by-frame method has limited usefulness for real time applications. Even with a very powerful computer, the frame-by-frame method can only produce simple, crude face configurations which lack detail in the facial expression. Additionally, in a real time system, events may cause the audio and the video portions of the synthetic speech to become unsynchronized. The lack of synchronization between the audio and the video portions may increase with time and if not corrected, the lack of synchronization may become very noticeable and annoying. Conventional audio-visual speech systems lack the ability to resynchronize in real time, sometimes called "adaptive synchronization."