The present invention relates to a method for synthesizing a picture through digital processing, and more particularly, to a system for synthesizing a (still or moving) picture of a face which represents changes in the shape of mouth accompanying the production of a speech output.
When a main utters a vocal sound, vocal information is produced by an articulator, and at the same time, his mouth moves as he utters (i.e, changes in the shape of the mouth in outward appearance). A method, which converts a sentence input as an input text to speech information and outputs it, is called a speech synthesis, and this method has achieved a fair success. In contrast thereto, few reports have been published on a method for producing a picture of a face which has mouth-shape variations in correspondence to an input sentence, except the following report by Kiyotoshi Matsuoka and Kenji Kurose.
The method proposed by Matsuoka and Kurose is disclosed in a published paper [kiyotoshi Matluoka and Kenji Kurose: "A moving picture program for a training in speech reading for the deaf," Journal of the Institute of Electronic Information and Communication Engineers of Japan, Vol. J70-D, No. 11, pp. 2167-2171 (November 1987)]
Besides, there has also been reported, as a related prior art, a method for presuming mouth-shape variations corresponding to an input text. This method is disclosed in a published paper [Shigeo Morishima, Kiyoharu Aizawa and Hiroshi Hara: "Studies of automatic synthesis of expressions on the basis of speech information," 4TH NICOGRAPH article contest, Collection of Articles, pp. 139-146, Nihon computer Graphics Association (November 1988)]. This article proposes a method which calculates the logarithmic mean power of input speech information and controls the opening of the mouth accordingly and a method which calculates a linear prediction coefficient corresponding to the formant characteristic of the vocal tract and presumes the mouth shape.
The method by Matsuoka and Kurose has been described above as a conventional method for producing pictures of a face which have mouth-shape variations corresponding to a sentence (an input text) being input, but this method poses such problems as follows: Although a vocal sound and the mouth shape are closely related to each other in utterance, the method basically syllabicates the sentence and selects mouth-shape patterns on the basis of the correspondence in terms of characters, and consequently, the correlation between the speech generating mechanism and the mouth-shape generation is insufficient. This introduces difficulty in producing the mouth shape correctly in correspondence to the speech output. Further, although a phoneme (a minimum unit in utterance, a syllable being composed of a plurality of phonemes) differs in duration in accordance with the connection between it and the preceding and following phonemes, the method by Matsuoka and Kurose fixedly assigns four frames to each syllable, and consequently, it is difficult to represent natural mouth-shape variations in correspondence to the input sentence. Moreover, in the case of outputting the sound and the mouth-shape picture in response to the sentence being input, it is difficult to match them with each other.
The method proposed by Morishima, Aizawa and Harashima is to presume the mouth shape on the basis of input speech information, and hence cannot be applied to the production of a moving picture which has mouth-shape variations corresponding to the input sentence.