In speech synthesis, there is an increasing need that not only a voice is selected from a small amount of candidates previously prepared for reading, but also a speech synthesis dictionary of voices of specific speakers such as well-recognized persons and familiar persons is newly generated for reading a variety of text contents. In order to satisfy such a need, a technique has been proposed in which a speech synthesis dictionary is automatically generated from speech data of an object speaker who is an object of dictionary generation. Also, as a technique of generating a speech synthesis dictionary from a small amount of speech data of an object speaker, there is a speaker adaptation technique in which a previously prepared model representing the average characteristics of a plurality of speakers is converted so as to become closer to the characteristics of an object speaker thereby to generate a model of the object speaker.
A main object of conventional techniques of automatically generating a speech synthesis dictionary is to resemble a voice and a speaking manner of an object speaker as much as possible. However, an object speaker who becomes an object of dictionary generation includes not only a professional narrator and a voice actor but also a general speaker who has never received voice training. For this reason, when the utterance skill of an object speaker is low, the low skill comes to be faithfully reproduced, resulting in a speech synthesis dictionary that is hard to use in some applications.
In addition, there is also a need for generation of a speech synthesis dictionary not only in a native language of an object speaker but also in a foreign language with a voice of an object speaker. To satisfy this need, if a speech of an object speaker reading a foreign language can be recorded, a speech synthesis dictionary of the language can be generated from this recorded speech. However, when a speech synthesis dictionary is generated from a recorded speech including incorrect phonation as phonation of the language or including unnatural phonation with an accent, the characteristics of the phonation are reflected on the speech synthesis dictionary. Accordingly, when native speakers listen to the speech synthesized with the speech synthesis dictionary, they cannot understand it.