This invention concerns audio-visual or multimedia communication systems and in particular a method and an apparatus for the animation, driven by parameters derived from audio sources, of a synthesized human face model.
At present, development activities for multimedia applications are considering the integration of natural and synthetic audio-visual objects with increasing interest, in order to facilitate and improve user application interaction. In such an area, adopting anthropomorphic models to facilitate the man-machine interaction is envisaged. Such interest has been also perceived by international standardization bodies, and the ISO/IEC standard 14496, xe2x80x9cGeneric Coding of Audio-Visual Objectsxe2x80x9d, has entered at present its definition phase. Said standard, which is commonly known as the xe2x80x9cMPEG-4 standardxe2x80x9d and is hereinafter referred to by such term, is aimed among other things at providing a reference framework for said applications.
Regardless of the specific solutions given by the MPEG-4 standard, the anthropomorphic models are thought of as an ancillary means to other information streams and are seen as objects capable of animation, where the animation is driven, by way of an example, by audio signals, such as the voice. In that case it is necessary to develop animation systems that, in synchronism with the voice itself, can deform the geometry and the look of the models in such a way that the synthetic faces take up typical countenances related to speech. The requisite target is a talking head or face that has a look as much as possible close to reality.
The application contexts of animated models of that kind may range from Internet applications, such as welcome messages or on line assistance messages, to co-operative work applications (for instance, electronic mail readers), as well as to professional applications, such as the implementation of post-production effects in the film and TV industry, to video games, and so on.
The models of human faces are generally implemented starting from a geometric representation formed by a 3-D mesh structure or xe2x80x9cwire framexe2x80x9d. The animation is based on the application in sequence and without interruption of appropriate deformations of the polygons forming the mesh structure (or of a subset of such polygons) to such way as to achieve the required effect during the display phase, in a specific case, movement of the jaw and lip region.
The solution defined-by the MPEG-4 standard envisages for such a purpose the use of a set of face animation parameters, defined independently of the model, so as to ensure the interworking of the systems. This set of parameters is organized on two layers: the upper layer is formed by the so called xe2x80x9cvisemesxe2x80x9d which represent the positions of the speaker""s mouth in correspondence with the phonemes (i.e. the elementary sound units); the lower layer represents instead the elementary deformations to be applied in correspondence with the different visemes. The standard precisely defines how lower layer parameters must be used, whereas it does not set constraints on the use of upper layer parameters. The standard defines a possible association between phonemes and visemes for the voice driven animation; thereafter relating parameters shall have to be applied to the model adopted.
Different methods of achieving animation are known from the literature. By way of an example, one can mention the following papers: xe2x80x9cConverting Speech into Lip Movements: A Multimedia Telephone for Hard of Hearing Peoplexe2x80x9d, by F. Lavagetto, IEEE Transactions on Rehabilitation Engineering, Vol.3, No. 1, March 1995; DIST, University of Genoa, xe2x80x9cDescription of algorithms for Speech-to-Facial Movements Transformationsxe2x80x9d, ACTS xe2x80x9cSPLITxe2x80x9d Project, November 1995; TUB, Technical University of Berlin, xe2x80x9cAnalysis and Synthesis of Visual Speech Movements, ACTS xe2x80x9cSPLITxe2x80x9d Project, November 1995.
The first document describes the possibility of implementing animation starting from phonemes, by identifying the visemes associated and transforming the visemes into articulatory parameters to be applied to a model; alternatively it suggests the direct transformation of spectral information into articulatory parameters through a neural network adequately trained. However the adopted articulatory parameters are not the facial animation parameters envisaged by MPEG-4 standard and therefore the suggested method is not flexible. Also the two papers presented at the ACTS xe2x80x9cSPLITxe2x80x9d Project do not describe the use of facial animation parameters foreseen by MPEG-4 standard; further the obtained parameters are only aimed at choosing an image from a database containing images of lips in different-positions {corresponding to the various visemes).
According to this invention, a method and an apparatus for animation are provided that are able to receive visemes and to apply the appropriate, geometric deformations to any facial model complying with MPEG-4 standard. Besides assuring a much higher quality, this allows the user to observe the synthetic speaker in positions different from the frontal one, to move closer to or away from it, etc.
More particularly; the invention provides a method wherein the driving audio signal is converted into phonetic data readable by a machine and such data are transformed into parameters representative of elementary deformations to be applied to such model, and wherein the transformation of phonetic data includes the following steps: associating individual items of phonetic information or groups of phonetic information items (visemes) representative of a corresponding position of the speaker""s mouth, said visemes being selected within a set which comprises visemes independent of the language of the driving audio signal and visemes specific for such a language;
splitting each viseme into a group of macroparameters characterizing the mouth shape and the positions of lips and jaw, and associating each of the macroparameters of a given viseme with an intensity value representative of a displacement from a neutral position and selected within an interval determined in an initialization phase so as to guarantee a good naturalness of the animated model;
splitting the macroparameters into said parameters representative of deformations to be applied to a face model, which parameters are selected within a group of standard facial animation parameters relating to the mouth movements, and associating said parameters with intensity values which depend on the intensity values of macroparameters and which are also selected within an interval designed to guarantee the naturalness of the animated model.
The invention also concerns the apparatus for the implementation of the method, comprising:
means for generating phonetic information representative of the driving audio signal, readable by a machine; means for converting the phonetic information into parameters representative of elementary deformations to be applied to such a model, said conversion means being capable of: associating individual phonetic information items or groups of phonetic information items with respective information items (visems} representative of a corresponding mouth position in the synthesized model, the visemes being read from a memory containing visemes independent of the language of the driving audio signal and visemes specific for such a language; splitting each viseme into a group of macroparameters characterizing mouth shape and positions of lips and jaw in the model; associating each of the macroparameters of a given viseme with an intensity value representative of a displacement from a neutral position and selected within a given interval in an initialization phase so as to guarantee a good naturalness of the animated model; splitting the macroparameters into parameters representative of deformations to be applied to such a model, which parameters are selected within a group of standard facial animation parameters relating to mouth movements; associating said parameters with intensity values which depend on the intensity values of the macroparameters and which also are selected within an interval designed to guarantee the, naturalness of the animated model, and means for applying the parameters to the model, under control of the means for the generation of phonetic information.
In the paper xe2x80x9cLips and Jaw Movements for Vowels and Consonants: Spatio-Temporal Characteristics and Bimodal Recognition Applicationsxe2x80x9d by P. Cosi and E. Magno Caldognetto, presented at the NATO-ASI Workshop on Speech Reading (Bonas, France, Aug. 28 to Sep. 10, 1995) and published in xe2x80x9cSpeech Reading by Human Machinesxe2x80x9d edited by D. G. Stork, M: E. Henneke, NATOxe2x80x94ASI Series 150, Berlin, Springer-Verlag, 1996, pages 291 to 314, the possibility is mentioned of characterizing a viseme through four macro-parameters, namely:
mouth width (hereinafter referred to as LOW from the initials of Lip Opening Width)
vertical distance between lips (hereinafter referred to as LOH, from the initials of Lip Opening Height)
jaw opening (hereinafter indicated as JY)
lip protrusion (hereinafter indicated as LP) and it is said in general that each of those macro-parameters is associated to an intensity value. Nevertheless, the above-cited paper essentially concerns the study of interactions between voice and facial movements and does not envisage the application of results to the facial animation, for which the actual knowledge of the intensity values is an essential condition for the achievement of an animated model which is as natural as possible.