Computer animation has come into widespread use for a variety of applications. One such application is character animation. For example, a game program may present an animated character for entertainment, or an educational program may include an animated teacher character. In addition, animated characters are a useful part of social interfaces that present an interactive interface with human qualities. For instance, an animated character may appear on a computer display to help a user having difficulty completing a function or to answer questions. The character's creators may give it certain human traits reflected in gestures and other behavior, and the character may be programmed to react to actions by the user.
A challenge facing computer animators is presenting a convincing animation. One element of this challenge involves presenting a speaking character. Sound output for the character can be sent to a sound device such as a computer speaker. In the character animation, some activity is performed, such as having the character's mouth move or displaying the text of the spoken words in an accompanying word balloon, such as that shown in a newspaper comic strip. The appearance of words in the balloon can be paced to provide a closed-captioning effect. In this way, the user is presented with the illusion that the character on the display is actually speaking the words sounded from the computer speaker.
However, to create a compelling simulation of a speaking character, the character's mouth should be synchronized with the audio output. Part of the human communication experience includes receiving visual cues from whoever is speaking. If a character's mouth movement does not match the spoken words, the user will not experience a realistic presentation of the character. Instead, the animation is much like a foreign film in which the spoken translation is dubbed over the original sound track. In addition, if the appearance of the words in the character's word balloon is not properly paced with the character's speech, the resulting presentation can be confusing. Poor quality animation reduces the effectiveness of the character presentation. This can be especially troublesome if the character is being used as part of a social interface that is based on presenting a convincing simulation of an interactive speaking character. A social interface can be a useful tool for placing the computer user at ease and for assisting the user with unfamiliar tasks. However, a confusing character presentation defeats the purpose of a social interface.
When animation is done without a computer, synchronization is accomplished by an animator who draws each frame of the animated character to reflect an appropriate mouth shape. Inappropriate frames in an animation are usually perceptible by the viewer and result in an inferior animation. Therefore, the animator is typically a highly skilled professional who is highly compensated for high quality work. In addition, the process can be time consuming, as the animator often reviews the animation a small portion at a time to craft appropriate mouth shapes in each animation frame.
With the advent of computer animation systems, various tools have become available to professional animators to assist in the animation process. However, even with the aid of a computer, the professional animator still reviews and edits the animation a small portion at a time to ensure an appropriate mouth shape reflects what is being spoken in the recorded speech. Although the computer can provide some useful features, a great deal of work is still required by the animator, adding considerably to development costs. Further, computer software typically undergoes multiple revisions during its life cycle. Repeatedly involving the professional animator in each revision can become prohibitively expensive.
To avoid the expenses related to the labor-intensive task of the animator, some software developers have addressed the problem of mouth synchronization by using the amplitude of the accompanying recorded speech to control mouth movement. Throughout the animation, the size of the character's mouth opening is adjusted to match the amplitude of the speech sounded from the computer's speaker. However, this approach has the drawback of inaccurately depicting the character's mouth in many instances. For example, the amplitude of an aspirated sound such as the "h" in "hello" is typically very low. Accordingly, based on amplitude, a closed mouth might be displayed when the "h" sound is voiced. However, the human mouth must be open in order to pronounce the "h" sound. Similar problems exist for other sounds. As a result, this approach has not led to high quality presentations of animated characters.
Another approach to solving the synchronization problem is to use a synthetic voice generated by a text to speech ("TTS") software engine to generate the speech sound for the character animation. A text to speech engine can output a synthetic voice based on a text string. For instance, if supplied with the text "hello," the TTS engine will produce a voice speaking the word "hello." As the TTS engine generates output, a system can select appropriate mouth shapes for use in the animation. The result is animation in which the character's mouth movement is synchronized with the synthetic voice. However, due to various limitations associated with synthetic voices, the sound output does not result in a voice that is of the quality available from human professional vocal talent. Thus, the TTS approach does not result in high quality animated speaking characters. In addition, one of the features of a social interface is to put the user at ease by presenting human characteristics in the animated character. Typically, the user perceives that a synthetic voice is that of a machine lacking familiar human characteristics. As a result, the TTS approach fails to offer the convincing presentation needed for a social interface.
The invention provides a method and system for synchronizing computer output or processing with recorded speech. The invention is particularly suited to synchronizing the animation of a character with recorded speech while avoiding the problems described above. Although the synchronization can be performed without a professional animator, the resulting animation is of the high quality necessary for a compelling presentation of a speaking character. The invention can also be used to synchronize other computer output with recorded speech. For example, a background color or background scene can be changed based on an event in the recorded speech.
In one implementation, a system synchronizes the animation of a character with recorded speech in the form of speech sound data. The system includes a sound file tool, a speech recognition engine, and a file player. The sound file tool acquires the speech sound data and a text of the speech sound data. The speech recognition engine analyzes the speech sound data and the text to determine linguistic event values and time values. A linguistic event value indicates a linguistic event in the speech sound data, such as a spoken phoneme, a spoken word, or some other event. A time value indicates when the linguistic event occurs within the speech sound data. The sound file tool annotates the speech sound data with these values to create a linguistically enhanced sound file.
When the character is animated, the file player plays the linguistically enhanced sound file to produce sound output and send information to the animation. The information includes events specifying that the animation perform some action to indicate the linguistic event at a time indicated by the time value. For example, a particular mouth shape associated with a spoken phoneme could be presented in a frame of the character animation or the text of a spoken word could be presented in the character's word balloon. The result is a synchronized animation of a quality superior to that produced by amplitude-based mouth shape selection.
In addition, since a human voice is used, the quality of the sound output is superior to that produced by a TTS-based synthetic voice, and the invention provides a compelling illusion of a speaking character. Since the process of acquiring linguistic information such as phoneme and word break data is automated, the process can be performed by a user who is unfamiliar with the art of animation.
Another aspect of the invention is a system for editing the linguistic event values and time values. This system is implemented in a sound editing tool that provides a user interface displaying a graphical representation of a sound wave representing recorded speech. The tool enables the user to edit the timing information to improve performance. Thus, the invention might also be useful to a professional animator. In a further aspect of the invention, the linguistic information and sound data can be combined into a single enhanced sound file, providing ease of distribution and use. In addition, the file can be constructed so that it can be played with a player capable of playing the original sound data, providing compatibility.
In another aspect of the invention, programming interfaces in the system are arranged to reduce the costs of prototyping. The enhanced sound file player is arranged so that it has an interface to the animation controller that is compatible with the interface of a TTS-based animation system. In this way, the character's actions and speech can be prototyped using the inexpensive TTS option, supplying plain text instead of a recorded human voice. The TTS engine generates a synthetic voice and provides data for synchronizing the character's mouth. The synthetic voice is often acceptable for prototyping purposes. When the development is in the final phases, an enhanced sound file can be generated with professional vocal talent. The enhanced sound file can be easily integrated into the character because the TTS engine and the enhanced sound file player use compatible interfaces. In this way, professional vocal talent need not be employed throughout the entire development process, reducing development costs.
Further features and advantages of the invention will become apparent with reference to the following detailed description of illustrated embodiments that proceeds with reference to the accompanying drawings.