It is known that the delivery of synthesised or recorded speech messages can be enhanced by the use of an animated picture of the sender, or by displaying at least the head part of an avatar created of the sender, but in both cases in which only the lips move in synchrony with the reproduced speech. Where a picture of the sender is used, the impression of movement of the lips is created by displaying what is known as a “viseme”, which is an image of a human face (for example of the message sender) provided with the lips thereof in one of a number of identifiable shapes, which each represent a lip shape associated with one or more phonemes. Phonemes are, of course, well known in the art and are the individual discrete sounds which are used within a language. It is estimated that there are approximately 44 phonemes in the English language, but perhaps only as few as twenty or so visemes. Therefore, it is possible to display the same viseme when reproducing one of several phonemes.
In operation, a speech reproducer such as a speech synthesiser outputs acoustic waveforms corresponding to a sequence of phonemes, and at the same time a display means displays to the user the appropriate viseme associated with the particular phoneme which has been reproduced at any particular time. Therefore, the user obtains the illusion of an image of the sender whose lips appear to move in synchrony with the reproduced speech. It should be noted that here the visemes are two dimensional images of the sender.
The alternative method known in the prior art, as mentioned above, is to produce either a whole body avatar, or at least a three dimensional virtual model of the sender's head, which is then shaped and textured to look like the sender. The lips of the head model can then be controlled to move in synchrony with the reproduced speech, such that the lips of the model assume the appropriate shape for the particular phoneme being reproduced at any particular time. However such systems involve complex head modelling using a virtual wire frame reshaped by difficult image processing or invasive sensing, and requires a process in which a still picture is accurately conformed to the given model. It is therefore still difficult to reproduce head models without undergoing invasive sensing or scanning of the person whose model is to be created, such as, for example, in a specialist avatar creation booth such as those provided by Avatar-Me Ltd, a United Kingdom limited company no. 03560745. Furthermore, once a 3D model has been obtained, the computation required to achieve the illusion of the model speaking to a user is high, and not presently suitable for implementation on mobile devices, such as mobile telephones, personal digital assistants, or the like.
The first of the aforementioned methods, being that of displaying a sequence of two-dimensional visemes in synchrony with the reproduced speech, does not suffer from the same computational intensity problems as the second of the aforementioned methods, but does suffer from the problem that the displayed image appears to be almost robotic to the viewer, in that they can appear stale, automated, and not life-like. This is because the only movement apparent to the viewer is the movement of the lips to create the appropriate viseme shape corresponding to the present phoneme being reproduced. However, such movement does not correspond to the natural movement of a human being while talking, as it has been observed that most human beings also make very small head movements at the same time as speaking (see ‘Autonomous secondary gaze behaviour, M Gillies, N Dodgeson & D Ballin, Proceedings of the AISB2002 symposium on Animating Expressive Characters for Social Interactions, ISBN 1902956256’), but such head movements are difficult to recreate artificially Whilst it would be possible to modify the second of the aforementioned methods (i.e. that of the 3D avatar model) to cause the model to move slightly in accordance with the observed human behaviour, such movement of course brings with it the same problems of high computational intensity as already discussed. In order to get around this problem it would therefore be advantageous if the first of the aforementioned methods (i.e. the two-dimensional viseme method) could be modified to reproduce the observed behaviour.