The present invention relates to animation synthesis technology, and more specifically, to text-to-articulatory movement conversion.
With respect to animation production, mouth animation synthesis of a character is important. It is desirable for the mouth animation synthesis that mouth movement of the character when speaking a text is consistent with the text. In the current animation production, the mouth movement usually relates to mouth open, mouth closed and activity of muscles around the mouth, which are two-dimensional movement information of the mouth. Usually the mouth movement may be represented by a number of mouth features.
In mouth animation synthesis, a pre-trained model for text to mouth features may be used to predict the mouth features based on a text. Then, the predicted mouth features may be used to reconstruct a mouth image. As the mouth features describe the 2-dimensional mouth movement, the reconstructed mouth image may not match the text accurately.
In addition, the model for text to mouth features is usually monolingual. In the case that the text is in a different language, the text may be translated into the same language as the model, and then the translated text is used to predict the mouth features. In this way, the synthesized mouth movement may not be consistent with the original text.