It is desirable to show movement of the face during speech, for example on the face of an avatar or a character in a game, such as a game played on a game console, for example Sony's PS3. In such application, the speech could be that of the user, or it could be prerecorded speech. Unfortunately, when facial animation is driven by speech recognition, the appearance of the face tends to be unnatural.
A technique, referred to as lip-synch, is often used in animated movies, such as those produced by Pixar and Dreamworks. In such movies, it is necessary to move the mouth and/or other facial features of characters according to the character's speech. Animation of the mouth and other facial features is typically accomplished by drawing those features by hand.
The current state of the art also uses a speech recognition engine to analyze the phonemes contained in the character's speech. This results in a sequence of phonemes. Each such sequence can be bound to a typical face shape for that phoneme. For purposes of the discussion herein, these face shapes are referred to as visemes (see, for example, Information on converting phonemes to visemes (visual display), Annosoft, LLC, www.annosoft.com/phonemes.html). Thus, a particular viseme is the shape of the face for a particular phoneme. For example, if a character says the phoneme that produces the “ee” viseme, then the character's face is given a specific, corresponding shape. Animated speech involves displaying a sequence of visemes that correspond to the phonemes contained in such speech.
Face shape for these visemes can be created in various ways. One such technique is mesh-morphing, e.g. where a mesh of 3D points is controlled by a set of conformation and expression parameters, in which the former group controls the relative location of facial feature points such as eye and lip corners, where changing these parameters can re-shape a base model to create new heads, and in which the latter group of parameters (expression) are facial actions that can be performed on face such as stretching lips or closing eyes; and another such technique is bone skinning, e.g. animation in which a character is represented in two parts: a surface representation used to draw the character (called the skin) and a hierarchical set of bones used for animation only (called the skeleton). To animate a 3-D model, one or both of these techniques is used to move the facial mesh.
If mesh-morphing is used, the morph target is morphed to a key frame. Thus, if the facial expression is the “ee” face and the next time frame output is “oo,” then the facial expressions are animated to transition from one to the other to give the impression of speech. In mesh-morphing, the weighting factor for each morph target is modified to make a transition from “ee” to “oo.” In mesh-morphing, there may be a base target of, for example, a character's face. In this case, the face should have some morph target, which is a mesh that is blended with a base target to transition from one facial expression to the other. If a weighting factor used to blend the base target with the morph target is set at one for “ee” and zero for “oo,” then the final face should be “ee.” If the weighting factor is set at zero and one, the final face should show “oo.” If the weighting factor is set at half and half, then the facial expression should be “eh,” which is in between “ee” and “oo”. Thus, it is possible to modify the weighting factors for the key frame, i.e. the designated pose or shape of this face, and make a smooth transition between facial expressions in accordance with speech.
The weight could be modified gradually, or it could be modified more quickly, depending on how quickly the transition occurs. One difficulty is that it is possible to change, for example, zero-one to one-zero, and the face would change from one shape to another in only one frame. However, changing the face in this way produces a change of expression that is jerky and unnatural. Thus, the weights should be chosen to make a smooth transition. While linear interpolation would work to create a smooth transition, the quality of the transition is not natural, such that there is presently no practical technique that allows for smooth, natural transitions in connection with mesh morphing.
Bone skinning, uses a kind of joint to move a partial mesh to make a particular kind of facial expression and can be useful also for creating visemes. In this case, every joint has some weighted parameters that are used for transitions. A set of parameters can create visemes, e.g. for “ee” or “oo.” As the joints move, they pull on the mesh and change the shape of the face.
Another basic problem with the use of facial animation to show speech concerns the speech recognition results themselves, which are a time series of phonemes and, as such, are compressed somewhat. For example, if a character says “I,” then the time series of the speech recognition results should be “a-a-a-a,” and at some point “e-e-e-e.” This is the ideal case. In the real world, the result is different. A person would say, “a-a-a-a” and some “oo” and some “e-e-e-e” and other sounds in between for a very short time. If the speech recognition result is followed in the facial animation, the face moves unnaturally fast and sometimes jumps, for example because there is an unintended phoneme in the middle of the transition from one intended phoneme to the other intended phoneme that is passed through during the transition. This is an artifact of speech, but the system tracks it nonetheless and moves the face for that phoneme as well, creating an unnatural movement in the face.
One approach to solving this problem is taught in Key-frame Removal Method for Blendshape-based Cartoon Lip-sync Animation, Shin-ichi Kawamoto et al. SIGGRAPH 2006. The basic idea in this teaching is to take the average vertex movement speed and remove key frames. Thus, the key frames are based on the vertex movement speed. Whether this technique may improve the appearance of an animated face during phoneme transitions is not known at this time, although it is thought that this approach would require substantial time to calculate average vertex movement speed if it were to be implemented naively.
Clearly, there is a need in the art for a technique that provides natural looking facial animation driven by speech recognition.