1. Field of the Invention
The present invention relates to a voice analysis device, voice analysis program and voice analysis method. In more detail, it relates to an image-generating device employing a voice analysis method according to the present invention and, in particular, to a lip synch animation image generating device that creates animation (lip synch animation) whereby the shape of the mouth is changed in accordance with voice.
2. Description of the Related Art
Voice analysis technology is currently employed in various fields. Examples are identifying a speaker by voice, converting voice into text, or generating lip synch animation whereby the shape of the mouth is varied in accordance with voice. The processing that is performed in these cases respectively involves: in the case of voice analysis technology, extraction of phonemes, i.e. the units that are employed to a distinguish meaning of words, from voice; in the case of speaker identification, identification of a speaker using the degree of similarity between extracted phonemes and reference patterns registered beforehand; in the case of text conversion, displaying letters corresponding to the extracted phonemes on a display or the like; and, in the case of creation of lip synch animation, displaying an image corresponding to the extracted phoneme on a display or the like.
The prior art includes the following methods of extracting phonemes from voice. For example, in the speaker identification system disclosed in Published Japanese Patent No. H6-32007, phonemes are extracted by determining, for each vowel, intervals such that the distance between a previously input reference pattern and the voice of the speaker is less than a prescribed value and establishing a correspondence between these intervals and the vowels.
Such intervals for extracting phonemes are called segments. In the animation image generating device of Laid-open Japanese patent application No. 2003-233389, a formant analysis using for example a Composite Sinusoidal Modeling (CSM) is conducted and phonemes are extracted based on formant information to characterize vowels.