1. Field of the Invention
The present invention generally relates to speech driven lip synthesis, and more particularly to the use of Hidden Markov Models (HMMs) to animate lip movements from acoustic speech.
2. Background Description
Visual speech synthesis has become an area of interest due to its possible application in such areas as a) solving bandwidth problems where audio is available but little or no video is available, e.g. videoless uplinking, b) movie dubbing, c) synthetic video conferencing, d) language learning using three dimensional animation of articulators, and e) distance learning.
Researchers in the prior art have applied an approach to solve this problem which assumes that speech is a linguistic entity made of small units of acoustic speech or xe2x80x9cphonemesxe2x80x9d. Speech is first segmented into a sequence of phonemes, and then each phoneme is mapped to a corresponding unit of visual speech (generally a distinct lip shape) or xe2x80x9cvisemexe2x80x9d. This approach has been applied using a variety of methods, in particular vector quantization, direct estimation, and the Hidden Markov Model (HMM).
Vector Quantization Methods
In this approach, the acoustic parametersxe2x80x94for example cepstral coefficient vectorsxe2x80x94are divided into a number of classes using vector quantization. For each acoustic class, the corresponding visual code words are averaged to produce a visual centroid. Each acoustic vector would be classified during the optimal acoustic vector quantizer, then mapped to the corresponding visual centroid. The drawback of this approach is that, because of the distinct output levels, it produces a staircase-like output rather than a smooth output.
Direct Estimation
In this approach, the best estimate of the visual parameters is derived directly from the joint statistics of the audio and visual parameters. Let ƒxcex1xcexd(xcex1,xcexd) denote the joint distribution of the feature vector [xcex1T,xcexdT]T comprising the acoustic features and the visual parameters. If we know the joint probability density function (pdf) ƒxcex1xcexd(xcex1,xcexd), then the optimal estimate of xcexd for a given xcex1 is calculated as follows:       E    ⁢          ⟨              v        /        a            ⟩        =      ∫          v      ⁢                                    f            av                    ⁡                      (                          a              ,              v                        )                                                f            a                    ⁡                      (            a            )                              ⁢              ⅆ        v            
HMM Based Approach
HMMs have been used in speech recognition for a long time, as described by Frederick Jelinek in Statistical Methods for Speech Recognition (MIT Press, 1998). Its use in lip synthesis has also been proposed by several researchers. The main idea behind using the HMM for lip synthesis is to define a mapping between the HMM states and image parameters. See T. Chen and Ram R. Rao, xe2x80x9cAudio-Visual Integration in Multimodal Communicationxe2x80x9d in Proceedings of the IEEE, May 1998. HMMs represent phonemes, but during the training phase an association is developed between the HMM states and the visual features. The training speech database is aligned into an HMM state sequence using Viterbi alignment. For a given HMM state, the visual features of the corresponding frames are averaged and assigned to the state. During the synthesis phase, the input speech is aligned to the HMM state sequence by the Viterbi alignment. The image parameters associated with HMM states are retrieved during the Viterbi alignment and then this sequence of image parameters is animated. See Eli Yamamoto, Satoshi Nakamura and Kiyohiro Shikano, xe2x80x9cLip Movement Synthesis from Speech Based on Hidden Markov Models,xe2x80x9d AVSP, 1997.
There are several problems with this approach. Because the number of distinguishable lip shapes are less than the number of phonemes there is some redundancy involved when using phoneme based HMMs for the synthesis. For example, during the alignment phase a computation for most likelihood is performed for each possible succeeding phoneme or phoneme sequence, which may not be necessary. Also, one cannot make use of context-dependency in the visual domain during alignment of phoneme based HMMs. Further, if phonemes which are visually different but acoustically similar are confused in the synthesis phase they will produce out of sequence visemes during the animation.
The advantages of first segmenting speech into a sequence of phonemes and then mapping each phoneme to a corresponding viseme, using any of the above implementation methods, are as follows:
1) the acoustic speech signal is explored fully so that all the context information is utilized and co-articulations (i.e. the change in the utterance of a sound by preceding and/or succeeding sound in a given sound sequence) are completely incorporated in the speech recognition model for recognizing phonemes, which are then mapped to corresponding visemes for visual speech synthesis, and
2) it provides the most precise speech analysis.
However, this approach has a number of disadvantages. First, one needs to recognize the spoken words or sentences and a phoneme to viseme mapping is required. This involves an unnecessary additional computational overhead because it is not really necessary to recognize the spoken words or sentences in order to synthesize lip movements. Second, alignment errors that occur during acoustic alignment of the speech signal can cause discontinuity in the synthesized visual sequence. For example, the acoustically similar nasal tones xe2x80x9cmxe2x80x9d and xe2x80x9cnxe2x80x9d can cause discontinuity based on phoneme confusion even though they are visually distinct. Third, more training data is required to train a speech recognizer based on phones since the number of phones (greater than fifty) is greater than the number of visemes.
It is therefore an object of the present invention to provide a model for lip synthesis based directly on the visemes.
A further object of the invention is to reduce the time required to find the most likely viseme sequence in lip synthesis systems.
Another object of the invention is to enable use of visual domain context dependency in determining the most likely viseme sequence.
It is also an object of the invention to reduce the size of training data required to train a system for lip synthesis.
The approach implemented by the present invention considers speech as a physical phenomenon and assumes that there is a functional relationship between the speech parameters and the visual parameters. So in this case the goal is to find the best functional approximation given sets of training data. This can be achieved using the implementation methods of vector quantization, neural networks, and the Hidden Markov Model(HMM) with Gaussian Mixtures. This approach considers only the relationship between the audio signal and the lip shapes rather than what was actually spoken. This method needs only the relationship between the speech parameters and visual parameters for visual speech synthesis, and therefore requires less computational overhead than the prior art methods. So in the approach taken by the invention the recognition of spoken words or sentences need not be performed at all.
Earlier approaches solved the problem of lip synthesis by first recognizing the speech using phoneme based HMMs and then converting these phonemes into corresponding lip shapes (visemes). The present approach uses viseme based training systems instead of using phoneme based training systems. In this approach it is not necessary to differentiate among those phonemes which look similar visually. Since the number of visemes is much less than the number of phonemes, the dimensionality of the space in which the system works is reduced. This results in reduced requirements for computation and training data.
The method of the invention synthesizes lip movements from speech acoustics by first grouping phonemes into a sequence of distinct visemes, and then applying this correspondence to new audio data to generate an output viseme sequence. The grouping can be accomplished by generating visemes from video data and then grouping audio data according to each viseme. The grouping can also be accomplished by generating phonemes from audio data and creating a mapping of phonemes to visemes.
In the HMM implementation (which can be used with each grouping methodology), HMM state probabilities are generated from input speech which has been aligned according to the viseme sequence. These HMM state probabilities are applied to acoustic speech input, aligning the input with a most likely viseme HMM state sequence, which is then used to animate lip movements. In the neural network implementation a neural network is trained to recognize a correspondence between visemes and the underlying audio data, and then the network is used to produce viseme output from new audio input.