An audio-to-video engine is a software program that generates a video of facial movements (e.g., a virtual talking head) from inputted speech audio. An audio-to-video engine may be useful in multimedia communication applications, such video conferencing, as it generating video in environments where direct video capturing is either not available or places an undesirable burden on the communication network. The audio-to-video engine may also be useful for increasing the intelligibility of speech.
In prior implementations, audio-to-video methods generally apply maximum likelihood estimation (MLE)-based conversion processes to a Gaussian Mixture Model (GMM) to estimate video feature vectors given a set of audio feature vectors. However, the MLE-based conversion processes typically include conversion errors since an audiovisual GMM with maximum likelihood on the training data does not necessarily result in converted visual trajectories that have minimized error in human perception.