1. Field of the Invention
The present invention relates to face animation at synthetically generated speech and to the speech belonging movement patterns. Especially is related to animation where a person who is speaking a first language which is translated into a second language which replaces the speaker""s original language at the reproduction.
2. Discussion of the Background
In patent application Se 9504367-5 is described how one can make visual synthesis with movement patterns directly connected to recorded xe2x80x9csoundsxe2x80x9d, polyphones determined by half-syllables. These movement patterns are recorded in nonsense syllables and in a frame phrase with the main stressed word after the test syllable in order to get as even a fundamental tone (f0) of the test syllable as possible, which is a condition for that at synthesis have possibility to artificially change this (f0) as much as possible. Visual synthesis certainly will be totally synchronized to the sound wave and be possible to be extended or xe2x80x9cshortenedxe2x80x9d in time depending on manipulation of the segments of the sound wave. On the other hand the movement patterns will not be able to signal prosodic information which is connected to larger deviations of movement in stressed than in unstressed syllable and at information focus when also the deviations of movement normally is larger than at no-focal position. The present invention has as its task to indicate a solution to the above mentioned problem.
In the Swedish patent application No 9504367-5 is informed about how facial expressions can be associated to produced sounds. Here is described how polyphones are stored together with movement patterns in connection with the sounds in question. The patent document, however, does not inform about how stresses and accentuations in the speech shall be reproduced at the animation.
The present invention relates to a device for prosody generation at visual synthesis. A number of half-syllables are stored together with registered movement patterns in a face. At concatenation synthesis of speech a number of half-syllables are put together to words and sentences. The words and the sentences are given a stress and an intonation pattern corresponding to the intended language. In the face a number of points and their movement patterns further have been registered. In connection with the generation of words and sentences the movement patterns of the different points are amplified depending on a given stress, sentence intonation. The given movement patterns after that are applied to the face, at which a lifelike animation is obtained at for instance a translation of a person""s speech in a first language to a second language. In a first embodiment the invention includes means for storing and reproduction of sound. Further, to the sounds associated movement patterns in a face are registered. Said movement pattern is represented by a number of points in the face. Sounds in question chiefly consist of a number of half-syllables, for instance xe2x80x9cspruxe2x80x9d. Movement patterns, in the model, for respective half-syllables are further registered and stored in said means or in database accessible from the means. In connection with the production of words and sentences said half-syllables are put together. At that a sequence is obtained which corresponds to one of the speaker intended line of words. In order to effect a natural spoken sequence, the speech is given a stress and sentence intonation corresponding to the inward sense of the speech. The movements of the different points further are put together and applied to the facial model, at which one to the in the speech corresponding movement pattern is obtained. A facial texture is after that applied to the model. To make the movement pattern in the face lifelike, the movement patterns are amplified in relation to stresses. Said stresses in the speech are applied to a facial model.
The movement pattern of the face is represented by a number of stored half-syllables. These half-syllables and belonging movement patterns are recorded in a neutral frame of mind respective stress. A set of half-syllables are stored in this way together with movement patterns in question. At translation of a speech in a first language to a second language, the fundamental stresses and movement patterns are transferred to the second language. One in the face reproduced movement pattern will at that be reflected in the reproduced speech.
The device further appoints stress positions for sentences and/or separate words. This can be done by already known methods, see for instance patent application No 9504367-5 which deals with speech synthesis. Accentuations and stresses are transferred to corresponding movement pattern in the movement pattern of the face. At reception of the speech in a first language the device further is arranged to translate the speech into a second language. Stresses in the first speech are registered and transferred to stresses in corresponding parts of the second language. The movement patterns in the face are at that adapted to the second language for stresses, sentence accentuations, as well as intonations. The speaker in this way is given one to the face applied movement pattern corresponding to one in the second language produced speech.
The in the face registered points are recorded for instance with marked face points which are followed by laser light or the like. The selection of points in the face is depending on to which extent the animation shall be, or is required to correspond to, a real movement pattern. The invention further relates to a method at visual speech synthesis. Words and sentences are created by putting together polyphones and stresses in words and clauses.
By the present invention possibility is given to reproduce a speaker""s presentation in a second language by one in the face reproduced movement pattern which corresponds to the movement pattern in the second language. Utilization of the invention is of importance at for instance phoning where the speaker is represented in picture. In a future when telecommunication systems will be responsible for translation between different languages, the risk of misunderstanding will, if not be eliminated, at least be considerably reduced. The fields in telephony which today are of interest are broadband transmissions which are applied at conference telephony. It also can be supposed to be of importance at video-telephony between individuals in future telecommunication systems as also the households by and by are getting broadband connections with possibility to utilize video-telephony.