The present invention relates to a method of generating an animated character representation using a processing system and apparatus for generating an animated character representation. In particular, the present invention relates to a system that uses input data, comprising content data and presentation data, to animate a character representative of a person, such as a news reader, announcer, presenter, or the like.
Character animation has previously been achieved in a number of ways. The most basic system is standard hand drawn animation, achieved by drawing a number of sequential frames and then displaying the frames at a high rate to generate the appearance of character movement. However, the production of such frames is time consuming and requires great skill in producing the desired appearance.
More recently, character animation has been achieved using computer-based systems. However, in systems like this, again the animation is predetermined by an artist, requiring great skill and work to produce the desired appearance.
Automated character animation has also been produced which operates by converting a standard text file into speech and then using visemes to animate a character. However, these systems suffer from the drawback that the range of movement presented by the character is limited and in particular is normally limited to the movement required to present the visemes. Any additional character movement must be added in manually at a later date and cannot be incorporated automatically. Furthermore, the characters can only demonstrate a very limited linear response to the text. Accordingly, each time the character reads the text the appearance of the character is identical. An example of such a system is described in U.S. Pat. No. 5,657,426.
This therefore does not present a very human appearance in which the specific movement of the character would vary each time the text is read out. Furthermore, when no text is being read the character is motionless, again contributing to the lack of human appeal or characterization of the character.
In accordance with a first aspect of the present invention, we provide an apparatus for generating an animated character representation, the apparatus comprising a processing system having:
an input for receiving marked-up input data including:
i content data representing speech to be presented; and,
ii presentation data representing the manner in which the speech is presented;
a processor coupled to the input for generating data according to a defined time-base, the data including:
i phoneme data generated in accordance with the content data; and,
ii viseme data generated in accordance with the phoneme data and the presentation data;
the processor being further adapted to:
iii generate audio data in accordance with the phoneme data;
iv generate image data in accordance with the viseme data; and,
v synchronise the output of the audio and image data in accordance with the defined time-base.
In accordance with a second aspect of the present invention, we provide a method of generating an animated character representation using a processing system, the method comprising:
receiving marked-up input data including:
i content data representing speech to be presented; and,
ii presentation data representing the manner in which the speech is presented;
generating data according to a defined time-base, the data including:
i phoneme data generated in accordance with the content data; and,
ii viseme data generated in accordance with the content data;
generating audio data in accordance with the phoneme data;
generating image data in accordance with the viseme data and the presentation data; and,
synchronising the output of the audio and image data in accordance with the defined time-base.
The present invention provides a method and apparatus for generating an animated character representation. This is achieved by using marked-up data including both content data and presentation data. The system then uses this information to generate phoneme and viseme data representing the speech to be presented by the character. By providing the presentation data this ensures that at least some variation in character appearance will automatically occur beyond that of the visemes required to make the character appear to speak. This contributes to the character having a far more lifelike appearance.
The marked-up data input to the system may be manually entered, for instance by typing text at a terminal, or may be derived from a data source. This allows the system to be used for the automated presentation of information from news and data sources and the like.
The processor usually includes:
a text-to-speech processor for generating the phoneme data and the audio data;
an animation processor for generating the viseme data and the image data; and,
a parser for:
parsing the received marked-up data;
detecting predetermined content data which is to be presented in a predetermined manner;
generating presentation data representative of the predetermined manner; and,
modifying the received marked-up data with the generated presentation data.
The use of specialised text-to-speech and animation processors allows the system to generate the audio and image data in real time, thereby speeding up the character animation process. The audio and image data can be generated at the same time or at different times, and/or in different locations, as required. It will be appreciated, that the text-to-speech and animation processors may be implemented as software within a single processor, or may alternatively be implemented as separate hardware components.
Parsing the received marked-up data allows presentation data to be added, which in turn allows data which has only minimal or no mark-up to be processed by the present invention. This also allows predetermined content to be represented in a predetermined manner. Furthermore, this allows the animated character to stress certain words, such as numbers, names, nouns and negatives, although this is not essential to the present invention.
The processing system will usually include a store for storing data, the parser being coupled to the store to obtain an indication of the predetermined content data therefrom. This allows information concerning the mark-up to be added to be stored centrally so that it can be accessed directly by the parser. Alternatively, the information may be obtained via a communications system, such as a LAN (Local Area Network) or the like, from a remote store.
Typically the apparatus includes a linguistic processor adapted to:
parse the content data;
determine the phonemes required to represent the content data; and,
generate phoneme time references for each of the phonemes, the phoneme time reference indicating the time at which the respective phoneme should be presented with respect to the time base.
It is preferable to use phonemes to generate the audio data to be presented by the animated character as this allows a small number of elemental sound units to represent the majority of sounds that would need to be made by the character to present the speech. Additionally, processing systems for determining phonemes from text are well known and readily implementable.
Furthermore, the generation of the phoneme time references allows the temporal location of each of the phonemes to be maintained as well as enabling the synchronization of remaining steps in the procedure.
Typically the linguistic processor is further adapted to:
parse the presentation data;
generate a number of tags representing the presentation data; and,
generate tag time references for each of the tags, the tag time reference indicating the time at which the respective tag should modify the manner of presentation with respect to the time base.
The use of tag time references allows the temporal position of the presentation data to be maintained relative to the phoneme data. Alternatively, other synchronisation techniques, could be used.
Usually the linguistic processor is coupled to the store to obtain an indication of the phonemes required to represent respective words. In this case, the indication is usually in the form of a set of rules specifying how the phonemes should be determined from the text. A dictionary can also be provided for exceptions that do not fit within these more general rules. This provides a simple technique for obtaining the phonemes based on the received data. However, any technique used in the art could be employed.
The text-to-speech processor preferably includes a concatenation processor adapted to:
determine phoneme data representing each of the phonemes; and,
concatenate the phoneme data in accordance with the phoneme time references to generate audio data representing the speech.
The use of a specialised concatenation processor ensures that the phoneme data, which is usually obtained from the store, can be readily combined to form the required audio data.
Furthermore, the concatenation processor is also adapted to modify the audio or phoneme data in accordance with the presentation data. This allows the audible voice of the character to be controlled in conjunction with the character""s appearance. Thus, for example, a different tone, pitch and speed of speech can be used depending on whether the character is supposed to be happy, sad, serious or the like. Alternatively however, the audible voice may remain unchanged irrespective of the character appearance. A further alternative is for separate voice modifications to be specified in the data file, independent of the presentation data.
The animation processor preferably includes a phoneme processor adapted to:
obtain the determined phonemes, and the associated phoneme time references, from the linguistic processor;
determine visemes corresponding to each of the determined phonemes; and,
determine a viseme time reference for each viseme in accordance with the phoneme time reference of the corresponding phoneme.
As there are only a limited number (approximately 48) phonemes and a limited number (approximately 20) visemes, it is relatively easy to convert each phoneme into a corresponding viseme. In this case, using viseme time references corresponding to the phoneme time references advantageously ensures the synchronisation of the visemes with the phonemes. This in turn ensures that lip motion is synchronized with the production of sound to achieve lip sync.
The animation processor usually also includes a viseme processor coupled to the store, the viseme processor being adapted to obtain viseme data from the store in accordance with the determined visemes, the viseme data including a number of parameters representing the variation required from a base character image to represent the respective viseme. The use of data representing the variation from a base face, allows a wide range of facial configurations to be implemented without requiring the amount of processing power required to generate the representation from scratch for each face. This helps speed up the processing time allowing the image data to be generated in real time as the content data is xe2x80x9creadxe2x80x9d by the character.
Preferably, the animation processor includes at least one modification processor adapted to modify the viseme data in accordance with the presentation data. By modifying the viseme data, this helps vary the appearance of the character to make the character appear more lifelike. This is typically achieved by modifying the parameters of the viseme data in accordance with modification data obtained from the store.
The animation processor usually includes at least one modification processor adapted to modify at least one of a specified expression, behaviour, and action. This allows different aspects of the characters appearance to be altered.
Preferably, a respective processoris implemented for modifying the behaviour, expression and actions separately. This allows more general appearances such as overall head movements to be controlled separately to specific appearances such as smiling, frowning or the like. Thus, the general appearance may be sad in which case the character may look generally upset with a down-turned mouth, or the like. A specific appearance however may be a laugh or smile and thus, even though the character has the overall appearance of being sad, this still allows a smile to be generated. Accordingly, this allows for detailed modification of the characters appearance as required thereby aiding the generation of a realistic image.
Achieving this by progressively modifying the parameters of the viseme data allows the action, expression and behaviour modifications to be implemented without undue complications. Alternatively however, separate image sequences representing the visemes, the expressions, the actions and the behaviours could be generated and then combined at a later stage.
Typically the or each modification processor is further adapted to modify the viseme data in accordance with pseudo-random data. This allows random head movement, or facial appearances to be included in the system thereby ensuring that the character animation would be non-identical even if based on the same input data file for any two successive animations. This helps reduce the repeating of certain word, phrase, appearance combinations, thereby helping to add to the naturalistic appearance of the animated character.
The animation processor further usually includes an interpolation processor for interpolating the viseme data to determine the appearance of the character at times between the specified visemes. This allows a continuous sequence of images to be generated.
A render processor is coupled to the interpolation processor for generating image data in accordance with the interpolated viseme data, the image data representing the character presenting the speech defined by the content data. In particular, if the processing system further includes an video processor, the render processor may form part of the video processor. This allows the image data to be rendered in real time without using up the resources of the main processor, thereby helping implement the invention in real time. The render processor may alternatively be implemented as software within the main processor itself, if sufficient resources are available.
Typically the video processor also generates video data representing the animated character sequence. This advantageously allows the animated character to be displayed either as image data or as video data allowing it to be displayed on a wide range of different display devices.
Optionally the system can further include a communications network interface, which in use couples the computing device to a communications network, thereby allowing the animated character representation to be transferred to other processing systems coupled to the communications network.
In this case, the input can be adapted to receive marked-up data from the communications network, allowing externally generated mark-up files to be used in the generation of an animated character sequence.
Typically the data file is an XML (extensible Mark-up Language) file. This is particularly advantageous as it allows presentation data to be specified within the XML file as XML mark-up. Accordingly, the content data which is used to control the appearance of the character can be annotated with appropriate elements defining presentation characteristics which should be implemented as the respective words are spoken.
The system may be implemented on a stand alone processing system. Alternatively the system may be implemented on a communications network, such as the Internet, a local or wide area network (LAN or WAN), or the like, so that the images can be generated centrally and viewed remotely.