Interest surrounding the integration of natural or synthetic objects in the development of multimedia applications to facilitate and increase user-application interaction is growing, and in this context the use of anthropomorphic models, destined to facilitate man-machine relationship, is being envisaged. This interest has been recently acknowledged also by international standardization organizations. ISO/IEC standard 14496 “Generic Coding of Audio-Visual Objects” (commonly known as the “MPEG-4 standard” and hereinafter referred to as such), among other things, aims at establishing a general framework for such applications.
In such applications in general, regardless of the specific solutions indicated in the MPEG-4 standard, anthropomorphic models are conceived to assist other information flows and are seen as objects which can be animated, where animation is driven by audio signals, as, for example, speech. These signals can also be considered as phonetic sequences, i.e. as sequences of “phonemes”, where a “phoneme” is the smallest linguistic unit (corresponding to the idea of a distinctive sound in a language).
In this case, animation systems able to deform the geometry and the appearance of the models synchronized to the voice itself need to be developed for the synthetic faces to assume the typical expressions of speech. The final result to which development tends is a talking head, or face, which appears natural to the greatest possible extent.
The application contexts of animated models of this kind can range from Internet applications, such as welcome or help-on-line messages, to co-operative work applications (e.g. e-mail browsers), to professional applications, such as the creation of cinema or television post-production effects, to video games, etc.
The models of human faces commonly used are, in general, made on the basis of a geometrical representation consisting of a three-dimensional mesh structure (known as a “wire-frame”). Animation is based on the application, in succession, of suitable transforms to the polygons forming the wire-frame (or a respective sub-set) to reproduce the required effect, i.e. in this specific case, the reproduction of movements related to speech.
The solution envisaged by the MPEG-4 standard for this purpose describes the use of a set of facial animation parameters, defined independently with respect to the model, to ensure interoperability of systems.
This set of parameters is organized on different levels: the highest level consists of the so-called “visemes” and “expressions”, while the lowest level consists of the elementary transforms permitting generic posture of the face. According to MPEG-4 standard, a viseme is the visual equivalent of one or more similar phonemes.
In this invention, the term viseme is used to indicate a shape of the face, associated with the utterance of a phoneme and obtained by means of the application of low-level MPEG-4 parameters, and does not therefore refer to high-level MPEG-4 parameters.
Various systems for animating facial models driven by voice are known in literature. For example, the following documents can be quoted: “Converting Speech into Lip Movements: A Multimedia Telephone for Hard of Hearing People”, by F. Lavagetto, IEEE Transactions of Rehabilitation Engineering, Vol. 3, N. 1, March 1995; DIST, Genoa University “Description of Algorithms for Speech-to-Facial Movements Transformation”, ACTS “SPLIT” Project, November 1995; TUB, Technical University of Berlin, “Analysis and Synthesis of Visual Speech Movements, ACTS “SPLIT” Project, November 1995. These systems, however, do not implement MPEG-4 standard compliant parameters and, for this reason, are not very flexible.
An MPEG-4 compliant standard animation method is described in Italian Patent Application no. TO98A000842 by the Applicant. This method associates visemes selected from a set, comprising the visemes defined by the MPEG-4 standard and visemes specific to a particular language, to phonemes or groups of phonemes. According to this method, visemes are split into a group of macro parameters, characterizing shape and/or position of the labial area and of the jaw of the model, and are associated to respective intensity values, representing the deviation from a neutral position and ensuring adequate naturalness of the animated model.
Furthermore, the macro parameters are split into the low-level facial animation parameters defined in the MPEG-4 standard, to which intensity values linked to the macro parameter values are associated also, ensuring adequate naturalness of the animated model.
Said method can be used for different languages and ensures adequate naturalness of the resulting synthetic model. However, the method is not based on motion data analysis tracked on the face of a real speaker. For this reason, the animation result is not very realistic or natural.