The present invention relates to character animation in image synthesis systems. In particular, though not exclusively, the invention relates to a method and system which operates in real-time to animate an image of a head of a character, including the character""s face, so that the character appears to speak.
WO97/36288 describes an image synthesis method for synthesis of a moving picture of a face to accompany speech, for example synthetic speech. This method is based on three distinct steps being carried out, these being: (1) text-to-phoneme generation; (2) phoneme-to-viseme generation; (3) viseme-to-animation generation. In step (2), a viseme selected from a specific list is associated with each phoneme. To achieve a smooth transition the new phoneme comes on while the old one goes off, this process can take 100 or more milliseconds, and during this time both visemes are active. Additionally, in this method each animation frame is operated on by matching a single target face image to an adjusted wire frame, the wireframe being adjusted according to which visemes are required to represent desired phonemes.
It is an aim of the present invention to improve synchronization of lip movement to words being spoken, as well as incorporating facial expressions and/or head movements appropriate to the speech into the animated face, so as to closely simulate a human face speaking.
According to a first aspect of the present invention we provide a method of generating an animated image of at least a head of a character which is speaking, the character""s face having visible articulation matching words being spoken, the method comprising:
(a) processing an input stream of marked up text comprising text to be spoken by an animated character and a plurality of mark up instructions representing behavioural features to be implemented in the animated character, so as to replace recognisable ones of said mark up instructions with predefined modified mark up instructions, so as to convert said input stream of marked up text into a modified output stream of marked up text;
(b) processing said modified output stream of marked up text using randomising means so as to insert additional randomly selected mark up instructions into said modified output stream of marked up text, said randomly selected mark up instructions representing random behavioural features to be implemented in the animated character;
(c) processing said modified output stream of marked up text, with said additional randomly inserted mark-up instructions inserted therein, so as to produce: an audio signal stream for use in generating an audio signal representing said text being spoken; a phoneme signal stream representing a sequence of phonemes corresponding to successive portions of the audio signal; and a mark up instruction stream comprising mark up instructions for use in other processing phases of the method;
(d) processing said phoneme stream using phoneme-to-viseme mapping means so as to produce a morph target stream representing a series of morph targets, where a morph target comprises a predefined shape of the head of the character, wherein each said morph target in the series comprises a viseme, where a viseme comprises a predefined shape of the face containing a mouth of the character in a predetermined mouth shape matching a said phoneme;
(e) modifying said morph target stream using mark-up instructions contained in said mark-up instruction stream so as to produce a modified morph target stream representing a modified series of morph targets comprising said series of visemes having at least one further morph target inserted therein;
(f) processing said modified morph target stream, and said mark-up instruction stream, so as to generate an animated image of at least the character""s head, said animated image comprising a sequence of image frames including image frames showing the character""s face in said predefined shapes corresponding to the morph targets in said modified morph target stream; and
(g) displaying said animated image on a display means synchronously with generating said audio signal from said audio signal stream so that the animated character appears to speak, the movement of the mouth portion of the character""s face matching the phonemes in the audio signal.
In one preferred embodiment, the method operates in real-time in the sense that said animated image of the character speaking audible words is generated on-the-fly from said second stream of marked up text, the image frames of the animated image being generated at a rate of at least 15 frames per second, most preferably at least 25 frames per second.
The modified mark up instructions which replace the recognisable mark up instructions in the first stream of marked up text, in step (a) of the method, comprise xe2x80x9cexpandedxe2x80x9d mark up instructions which are specified by templates or xe2x80x9cmacrosxe2x80x9d stored in a memory means of a computer system means in which the method is implemented, with which templates said mark up instructions in said first stream of marked up text are compared (in step (a)).
Each morph target comprises a three-dimensional model mesh of a head of a character, including all the facial features of the character. Where the morph target is a viseme, these facial features include said mouth portion in a predetermined mouth shape. Each morph target viseme may comprise an image of only a portion of a head or face of the character which portion, in the case of a viseme, includes said mouth portion in a predetermined mouth shape.
The predetermined mouth shape of the mouth portion in each viseme is preferably unique to that viseme.
Conveniently a further morph target is inserted into the series of visemes is a morph target in which the face has a predefined facial expression, such as a smile or a frown or is otherwise configured to provide additional expression in the final animated face. More than one further morph target is inserted, for example a sequence of morph targets is inserted so as to add more complicated additional expression to the face. For example an additional sequence of morph targets, including visemes, may be inserted to add particular words or phrases to the character""s speech, for example such as making the character say goodbye, where several expressions are combined in a short timed image sequence. The visemes in the original series of visemes may therefore relate to image frames which are more than one image frame apart in the final animation.
The method advantageously further includes inserting additional randomly selected morph targets into said modified series of morph targets, said additional morph targets providing random movements/expressions in the animated face to add authenticity to the face, for example where the character is a human being such randomly selected additional morph targets help to make the animated face more closely resemble a human being speaking. The images represented by said additional morph targets may, for example, show the eyes blinking or the head moving slightly from side to side or the face effecting a nervous twitch.
In step (d) of the method, preferably each phoneme represented in said phoneme stream is mapped to a respective viseme. However, at least one, of said phonemes and the respective visemes allocated therefor are then removed so that these visemes are not included in said first series of visemes, or said modified series of visemes. The phoneme(s) which are removed are those deemed unnecessary in order to sufficiently synchronize the mouth/lip movement of the final animated face to the text being spoken. Predetermined criteria are used to assess whether any particular phoneme and its respective viseme are unnecessary and should therefore be removed. Removal of unnecessary phonemes in this manner has the advantage of performing a smoothing action on the series of visemes such that the mouth in the final animated face will move with less discontinuity, in a more authentic fashion. Without this feature the mouth would be seen to xe2x80x9cpopxe2x80x9d about a lot.
Preferably step (f) of the method includes processing said mark-up instruction stream so as to generate an animation stream comprising animation instructions for generating at least one predetermined animation. For the avoidance of doubt, in the context of the present invention an animation is a sequence of changes in an image, showing what appears to be a movement in an object or objects. The animation instructions in the generated animation stream comprise one or more pregenerated sets of keyframes. A keyframe is a value or set of values that defines the position or shape of models of the said object(s). Preferably the animation instructions comprise sequences of positional keyframes defining gross head and body movement of the character having the animated face, relative to the space in which the character is located.
In a similar fashion to the way in which random facial movement and expression can be added to the animated image by adding randomly selected morph targets to the modified morph target series, the method may include introducing random animations to the animated image by adding to the animation stream additional, randomly selected, animation instructions for generating random movement in the animated character, for example random head or body movements.
Step (f) preferably further includes processing the modified morph target stream so as to animate or xe2x80x9cmorphxe2x80x9d each morph target therein. It will be appreciated that each morph target comprises a set of predetermined keyframes (xe2x80x9cmorph target keyframesxe2x80x9d) whereby each morph target can be animated individually. Step (f) advantageously further includes interpolating positions between morph target keyframes specified by the modified morph target stream so as to achieve smooth and realistic mouth movement in the character""s face in the final animated image. Preferably, step (f) also includes interpolating between the positional keyframes in the animation stream so as to achieve smooth motion in the final animated image, for example smooth face, head and body movement in the animated character.
The interpolation between keyframes may be carried out using one or more suitable smoothing processes to remove or reduce any discontinuities in the animated image. In the present invention, the preferred interpolation technique is to use spline-based smoothing techniques, particularly using hermite splines. This is particularly effective in simulating authentic human motion.
Steps (f) and (g) of the method include processing said modified morph target stream, said animation stream, said audio signal stream, and said mark-up instruction stream using renderer means to which generate and display, on a display means, the animated image, based on predefined data describing a scene to be animated, said animated image being synchronized to the audio signal which the renderer also generates from the audio signal stream.
According to another aspect of the invention there is provided a system for generating an animated image of at least a head of a character which is speaking, the character""s face having visible articulation matching words being spoken, the system comprising:
a first processing module for processing a first stream of marked up text comprising text to be spoken by an animated character and a plurality of mark up instructions representing behavioural features to be implemented in the animated character, so as to convert said first stream of marked up text into a second, modified, stream of marked up text by replacing recognisable ones of said mark up instructions in said first stream with predefined modified mark up instructions, wherein said first processing module includes mark up instruction recognition means comprising comparing means for comparing mark up instructions in said first stream with predetermined mark up instructions stored in a memory means accessible to the system;
a second processing module comprising randomiser means for processing said second stream of marked up text so as to insert randomly selected mark up instructions into said second stream of marked up text, said randomly selected mark up instruction representing random behavioural features to be implemented in the animated character;
a third processing module comprising text-to-speech converter means for processing said second stream of marked up text so as to produce: an audio signal stream for use in generating an audio signal representing said text being spoken; a phoneme stream representing a sequence of phonemes corresponding to successive portions of the audio signal; and a mark up instruction stream comprising mark up instructions for use in other processing modules of the system;
a fourth processing module comprising phoneme-to-viseme mapping means for processing said phoneme stream so as to produce a morph target stream representing a series of morph targets, where a morph target comprises a predefined shape of at least a portion of the face of the character, wherein each said morph target in the series comprises a viseme, where a viseme comprises a predefined shape of at least a mouth portion of the face containing a mouth of the character in a predetermined mouth shape matching a said phoneme;
a fifth processing module comprising morph target insertion means for modifying said morph target stream, using mark up instructions contained in said mark up instruction stream, so as to produce a modified morph target stream representing a modified series of morph targets comprising at least one further morph target inserted therein; and
a sixth processing module comprising rendering means for processing said modified morph target stream and said mark up instruction stream, so as to generate an animated image of at least the character""s head, said animated image comprising a sequence of image frames including image frames showing the character""s face in said predefined shapes corresponding to the morph targets in said modified morph target stream; and for displaying, on a display means accessible to the system in use thereof, said animated image synchronously with generating the audio signal from said audio signal stream so that the character appears to speak, the movement of the mouth portion of the face matching the phonemes in the audio signal.
The first stream of marked up text iterates through the first processing module a number of times, being further modified after each pass, before producing the second, modified stream of marked up text.
The six processing modules are incorporated in a framework means of the system. Each said module has interface means via which said module is interfaced with a respective complementary interface means provided in said framework means. Preferably, said framework means is configured to control all transfers of data signals from any one said module to another said module and to ensure that all such transfers take place via said framework means. The framework means is configured to control the operation of the system such that each said processing module is operated in turn.
The phonemes in said phoneme stream are time-marked such that the phoneme stream includes timing information for each phoneme in the phoneme stream, said timing information identifying the length of each phoneme (in the final audio signal). The morph targets in said modified morph target stream are similarly time-marked. The mark up instruction stream includes timing data for specifying the correct times at which morph targets and animation instructions are to be inserted into the morph target stream and animation stream respectively. Thus, each tag in the mark up instruction stream is preferably marked as occurring at a particular time at which the event specified in the tag will take place in the final animation.
It will be appreciated that the above system, including all said processing modules thereof, can be implemented entirely in software. The software may be distributed as a computer program product, for example, or may be provided as firmware in a hardware-based system.
Thus, according to another aspect of the present invention there is provided a computer program product comprising:
a computer usable medium having computer readable code means embodied in said medium for generating an animated image of at least the head of a character which is speaking, the character""s face having visible articulation matching words being spoken, said computer readable code means comprising:
program code for processing a first stream of marked up text comprising text to be spoken by an animated character and a plurality of mark up instructions representing behavioural features to be implemented in the animated character, so as to replace recognisable ones of said mark up instructions with predefined modified mark up instructions, so as to convert said first stream of marked up text into a second, modified, stream of marked up text;
program code for processing said second stream of marked up text using randomising means so as to insert additional randomly selected mark up instructions into said second stream of marked up text, said randomly selected mark up instructions representing random behavioral features to be implemented in the animated character;
program code for processing said second stream of marked up text so as to produce: an audio signal stream for generating an audio signal representing said text being spoken; a phoneme stream comprising a sequence of phonemes corresponding to successive portions of the audio signal; and a mark up instruction stream comprising mark up instructions for use in other processing phases of the method;
program code for processing said phoneme stream using phoneme-to-viseme mapping means so as to produce a morph target stream representing a series of morph targets, where a morph target comprises a predefined shape of at least a portion of the face of the character, wherein each said morph target in the series comprises a viseme, where a viseme comprises a predefined shape of at least a mouth portion of the face containing a mouth of the character in a predetermined mouth shape matching a said phoneme;
program code for modifying said morph target stream using mark-up instructions contained in said mark-up instruction stream so as to produce a modified morph target stream representing a modified series of morph targets comprising said series of visemes having at least one further morph target inserted therein; and
program code for: processing said modified morph target stream, using mark-up instructions contained in said mark-up instruction stream, so as to generate an animated image of at least the head of the character, said animated image comprising a sequence of image frames including image frames showing the character""s face in said predefined shapes corresponding to the morph targets in the modified morph target stream; and displaying said animated image on a display means synchronously with generating said audio signal from said audio signal stream so that the character appears to speak, the movement of the mouth portion of the face matching the phonemes in the audio signal.