1. Field
Aspects of the present innovations relate to a system of audio-visual communication method that uses portable devices. For example, communication may be facilitated via a virtual 3D animated humanoid from voice recognition pattern techniques applied to an audio channel.
2. Description of Related Information
Today many interactive systems use graphical implementation of talking faces to communicate with the end user. For example, applications such as answering machines, electronic storytellers and virtual reality have gained more attention by playing the voice synchronized with a realistic facial motion.
Computer-animated characters can be represented in two or three dimensions. They are known as virtual humanoids or avatars, which can be controlled by different techniques. For example, you can animate an avatar by means of commands found in graphical interfaces, in which the user must choose the commands among a finite set of buttons or via a mouse or keyboard.
The MPEG4 encoding provides means for implementing a virtual humanoid. In this coding, there are special parameters that allow the generation and transmission of video of a synthetic “talking head” for multimedia communication.
The MPEG4 encoding includes a set of Facial Animation Parameter (FAP). These parameters were designed based on the study of small facial actions, and are related to the motion performed by the muscles of the face. This encoding is capable of reproducing facial expressions and head movements made by a person.
These expressions can be grouped into two categories: simple and complex. Examples of the first expressions are: blinking, opening and closing the mouth, raise eyebrows. Complex expressions represent emotions such as, happiness, sadness and scare.
The visual representation of a phoneme is a viseme. Visemes are used for facial animation synchronized with speech, i.e., the shape of the lip, while a phoneme is pronounced.
Several methods of artificial vision using the features of lip pigmentation to make their detection and from the segmentation, assess the shape of lip to recognize the viseme.
However, the contrast between the colors of lips (not decorated) and the facial region is too small. This hinders the segmentation of the lips and makes the contour of the same very inaccurate and therefore extracting the characteristics of the lips does not seem efficient. For this reason, recognition of the shape of the mouth through techniques of computer vision is a complex task. Moreover, with the lips adorned (i.e., using lipstick, for example), it becomes even more complex due to a variety of colors available, complicating even further the design of an automated system for identification of visemes.
There are other additional difficulties, which are related to the quality of the image acquired by the digital camera. In the particular case of cameras integrated into portable devices such as mobile phones, Smart Phones and PDAs, the exposure time of the sensing elements makes the image obtained “blurred” due to the motion. Therefore, to achieve a good definition of the movements of the mouth, it is necessary that the mouth takes a large portion of the image to allow an efficient estimation of the shape of lips. In doing so the camera does not capture other important parts of the face that are very important for communication.
Therefore, a system for automatic recognition of lip format requires a high computational cost to perform the steps of detection and identification of shapes. In any electronic device, a high computational cost causes an increase in energy consumption and increased heat production.
In portable devices, a high consumption of energy causes the battery to discharge faster and its prolonged use causes a decrease in battery lifetime, since a battery has a finite number of recharges. For example, a battery of a portable device can last about 300 hours in standby and 7 hours in talking time.
As the computational cost to process the video is much higher than the costs necessary to make a conventional call, it is expected that the battery lifetime is much lower, reaching a maximum of two hours of use.
Because of the above described problems, methods based on artificial vision are concentrated in detecting the mouth only, for example, open or closed. Since speech perception depends not only on acoustic information, the format of mouth helps in speech intelligibility. For example, in noisy environments, the shape of the mouth can compensate for some loss of a syllable in the audio channel.
Thus, a more realistic way to make the communication through a virtual humanoid is to use voice to animate the motion of the mouth, leaving the other facial gestures (blinking, changing the look and eyebrow) as a function of the recognition of tones DTMF.
An efficient visual animation of the motion made by the mouth is useful for many applications, for example, the training of speech of people with hearing difficulties, production of movies and games, forms of interaction through virtual agents and electronic commerce.
The methods for developing this type of animation are based on mathematical parameters, on physical characteristics of the face in artificial vision and on audio processing.
An example of a methodology for tracking lip movements through computer vision was proposed by A. W. Senior work titled “Face and Feature Finding for a Face Recognition System” published in the International Conference on Audio and Video-based Biometric Person Authentication” p. 154-159 in March 1999. In this paper, it was made a search for the area of the face using a set of template windows of face and facial feature candidates. By means of a pyramidal analysis (multi-resolution), obtained through the scale of template windows, it is located the face and then the process is repeated to find the facial elements (eyes, mouth, nose and ears). Information extracted using this method is a set of four points of the corners of the mouth. Through these points, it is identified the width and the height of the mouth that can be used as parameters to define its shape and can be used to animate a virtual humanoid. However, this technique is not advantageous due to the number of combinations of windows made to find the face and the facial elements, making these methods computationally complex, which makes more difficult to implement it in portable devices due to their limited processing power.
The Brazilian patent document PI 9909611-0, applicant: Eyematic Interfaces, Inc, published on Oct. 21, 1999 describes a method for recognizing features for animating an avatar based on Wavelets. This document uses a wavelet series to detect end points of the mouth and, from these points, the tracking of the lip motion is carried out. Each end point of the mouth is found from the application of Wavelet Transform with a specific characteristic. As known by an expert, for applying a wavelet, it is necessary to make several convolutions during the step of identifying important points of the face. For computing the convolution at each point of the image, a large amount of multiplications and sums is needed. This makes the method too complex to be used in portable devices due to their limited memory and processing power.
The article proposed by M-T Yang et. al. titled “Lip Contour Extraction for Language Learning in VEC3D” published in the Journal of Machine Vision and Applications in April 2008 uses the segmentation of the lips through active contours. However, this method is quite robust, and the initial demand of the active contour and the subsequent interactions can take a long time. In applications such as video call, in which the motion of the avatar must be synchronized with the sound, this approach should not be used due to the long duration of the search and subsequent interactions.
Because the lip shape is the main responsible for the vowels formation and they are the main components of the syllable, vowel recognition by the voice processing is able to efficiently identify the shape of the lips and therefore animate the virtual humanoid.
A study of speech recognition that is related to facial motion was proposed by D. V. McAlister et. al entitled “Lip Synchronization for Animation,” published in SIGGRAPH in January 1997. This method applies the Fast Fourier Transform (FFT) to extract the features of the voice and, from these features, it performs the animation of the lip motion. Depending on the time of acquisition and sampling rate of the signal, this method can become computationally expensive, and therefore is not advantageous for application in portable devices with low computing power, such as devices having ideal use with the present innovations.
A similar method was proposed by G. Zorica and I. S. Pandzic in the paper entitled “Language Independent Real-Time Lip Synchronization Method Using Genetic Algorithm” published in the Journal of Signal Processing, p. 3644-3656 in December 2006. In this paper, the result of the Fast Fourier Transform (FFT) is converted into a new scale. In the converted signal, it is applied the discrete cosine transform (DCT), and after all these steps, the coefficients are extracted that represent the lip motion. For applications with dedicated processors or in a personal computer environment, the method is able to operate in real time. However, the number of operations needed to perform this procedure is much greater than the method proposed by McAlister, making it impractical for applications in portable devices due to the computational cost of all these operations.
In U.S. Pat. No. 6,735,566, granted on May 11, 2004, there is provided a method that uses speech recognition to a realistic facial animation. This method uses a training associating a video of the mouth to the voice for modeling the lip movements. This method uses a Hidden Markov Model for the extraction of lip features of each spoken sound. This approach has high rates and high liability, however, it is a method of pattern recognition computationally complex, making it impractical due to high computational cost.
Another example of facial animation from the voice has been described in U.S. Pat. No. 6,665,643, granted on Dec. 16, 2003, owner: Telecom Italia Lab SPA. According to the teachings, here, the recognition of spoken phonemes (vowels and consonants) is performed to animate a virtual model. In that patent, each spoken word is transformed into a text and from text, phonemes are identified. This solution is quite efficient, but requires the recognition of many phonemes. The best performance is obtained by identifying the content of the complete spoken message, being suitable for off-line communication.
The article proposed by S. Kshiragas and N. Magnenat-Thalmann entitled “Lip Synchronization Using Linear Predictive Analysis” published in the IEEE in July 2000 carries out the recognition of vowels using the linear predictive coding (LPC—Linear Predictive Coding’) for the extraction of features and these are processed by a Neural network.
A similar method was proposed by O. Farooq and S. Datta in his paper entitled “Phoneme Recognition using Wavelet Based Features” in the Journal of Information Sciences vol. 150, p. 5-15 Mar. 2003. This uses the Fast Wavelet Transform to extract the characteristics of the audio signal and also uses a neural network to recognize phonemes in English.
The feature extraction by linear prediction or Wavelet followed by its implementation in a neural network has a low computational complexity. In both methods, the recognition of vowels is made for English speakers. However, it is important to note that the pronunciation in other languages, for example, in Portuguese, has a much greater variety of phonemes. This is due to the fact that a single vowel may have some tonic and nasal variations thanks to the different accents from different regions. Consequently, methods based on linear prediction and wavelet have the disadvantage of generating false recognition due to this variety.
The patent document U.S. 20090158184, applicant: AOL LLC, published on Jun. 18, 2009 claims a system and method for animating an avatar based on an animation perceived in a second avatar, the method comprising the steps of graphically representing a first user with a first avatar capable of being animated, graphically representing a second user with a second avatar capable of being animated, in which communication messages are sent between the first and second user, receiving an indication of an animation of a first avatar, accessing information associating animations of avatar, identifying, based on the accessed information, an animation for the second avatar that is responsive to the indicated animation of the first avatar; and in response to the received indication, animating the second avatar based on the identified responsive animation. According to the teachings of this patent document, the avatar is animated through an application such online messages (like, for example, MSN or Skype). The avatar moves in accordance with the written words on the system. Thus, there is no recognition of sounds.
U.S. Pat. No. 7,176,956, granted on Feb. 13, 2007, owner: MOTOROLA INC, relates to the animation of avatars in communication between portable devices (video call). The avatars are moving through the changes of parameters obtained by techniques of image recognition provided by the camera of the mobile phone.
U.S. Pat. No. 7,231,205, granted on Jun. 12, 2007, holder: Telefonaktiebolaget LM Ericsson relates to the animation of avatars in communication between portable devices. The system is connected to a server that promotes the link between the devices and this is the element that provides the service of avatars. The state of the avatars can be changed via the keypad of the phone, but it do not provide for the recognition of sounds.
U.S. Pat. No. 6,665,640, granted on Dec. 16, 2003, owner: Phoenix Solutions, Inc presents an animated avatar using speech. The avatar uses as FAPs parameters of motion. The FAPs are obtained directly from a MPEG4 stream. This system does not simplify the visemes, nor is optimized for devices with low processing power such as mobile phones of today.
U.S. Pat. No. 7,123,262, granted on Oct. 17, 2006, owner: Telecom Italia Lab SpA uses viseme and generates FAPs over a face previously parameterized with Active Shape Model. According to the document, voice and image are joined to move the model face This does not constitutes an avatar, but a technique of animation of a modeled face. These techniques are generally robust and complex, rendering the implementation impossible in portable devices.
The document WO 2008031955, published on Mar. 20, 2008, describes a method and system for the animation of an avatar on a mobile device based on the sound signal corresponding to the voice of a caller in a telephone conversation. This method offers the look and motion of avatars in real time or quasi real, the avatar being chosen and/or configured through an online service over the network. The system of document WO 2008031955 comprises a mobile communication device, signal reception server, and means for calculating and analyzing the sound signal to move the avatar and simulate real-time conversation.
In sum, however, there are needs for systems and methods that overcome the drawbacks of such disclosures.