Many types of lip synchronization software are currently available. One type of lip synchronization software inputs an image of a person and a sequence of phonemes and outputs a sequence of images of the person with their lip movement synchronized to the phonemes. When the audio of the phonemes (e.g., via an enunciator) is output simultaneously with the sequence of images, the character appears to be speaking the audio and is sometimes referred to as a “talking head.” Another type of lip synchronization software additionally inputs expressions and adjusts the image of the character to reflect those expressions. For example, the expressions may be used to reflect sadness, happiness, worry, surprise, fright, and so on. Lip synchronization software may use morphing techniques to transition between phonemes and between the different expressions. For example, a change in expression from sad to happy may occur over a two-second interval, rather than from one update of the image to the next.
Lip synchronization software has been used in many applications including game and Internet communications. Game applications may provide images of characters of the game along with the voice of the characters. The voice of a character may be augmented with lip movement instructions that indicate how the lips are to move to correspond to the voice. When a character of the game is to speak, the game provides the lip synchronization software with the lip movement instructions (which may be represented by phonemes) along with an image of the character. The lip synchronization software then controls the display of the character with lips synchronized to the voice. Internet communication applications have used lip synchronization software to display a talking head representing a person who is currently speaking remotely. As a person speaks, corresponding lip movement instructions may be transmitted along with the voice to the computer systems of listeners. The lip movement instructions can be created in various ways. The lip movement instructions can be derived from analysis of the person's actual lip movement or can be a sequence of phonemes derived from the voice. A listener's computer system can display an image of the person (or caricature of the person) with the lips synchronized to the voice based on the lip movement instructions. The sending of lip movement instructions requires significantly less bandwidth than the sending of a video of the person. Thus, lip synchronization software can be used in situations where sending of video is not practical.
Typical applications that use lip synchronization software identify lip movement instructions either automatically as a person speaks or manually as specified by a developer of the application. Some applications may automatically generate lip movement instructions and then allow for manual modification of the instructions to achieve a desired effect.
It would be desirable to have a system that would automatically generate a talking head based on text, rather than voice, that is received in real time. There are many environments in which text is generated in real time, such as closed-captioned text of television broadcasts, text entered via a keyboard during an Internet chat or instant messaging session, text generated by a stenographer, and so on.