Artificially creating speech signals from any arbitrary text is called “text to speech synthesis”. Such text to speech synthesis is generally achieved in three stages of a language processing section, a prosodic processing section, and a speech synthesis section.
An incoming text is first input to the language processing section for morphological analysis, syntactic analysis, or others. The resulting text is then forwarded to the prosodic processing section for processing of accent or intonation, and phonetic sequence/prosodic information, e.g., fundamental frequency, phonetic duration, and others, is output therefrom. Then in the speech synthesis section, the phonetic sequence/prosodic information is used to generate speech waveforms.
One speech synthesis method is of unit selection type, selecting a specific speech unit sequence from a large number of speech units for speech synthesis with any provided phonetic sequence/prosodic information set as a target. With such speech synthesis of unit selection type, any provided phonetic sequence/prosodic information is used as a target for unit selection from a large number of previously-stored speech units. As one technique of unit selection, distortion observed in the resulting synthesized speech caused in the speech synthesis process is defined by level as a cost function, and selection of unit sequence is so performed as to reduce the cost. For example, distortions are converted into numbers as costs, and based on these costs, a speech unit sequence is selected for the use of speech synthesis. Here, the distortions include a target distortion representing a difference observed between a target speech and the candidate speech unit in terms of prosodic/phoneme environment or others, and a concatenation distortion caused by concatenating the consecutive speech units. Thus selected speech unit sequence is used to generate synthesized speech. As such, with such speech synthesis of unit selection type, selecting any appropriate speech unit sequence from a large number of speech units can generate a synthesized speech with less loss of sound quality that is often caused due to modifying and concatenating speech units.
There is another speech synthesis method of selecting a plurality of speech units (Tatsuya Mizutani, and Takehiko Kagoshima, “Speech synthesis based on selection and fusion of a multiple unit”, The Proceedings of 2004 Spring Meeting of the Acoustical Society of Japan, March 2004, Paper 1-7-3, pp. 217-218). That is, based on the level of distortion observed in a synthesized speech with any provided phonetic sequence/prosodic information set as a target, a plurality of speech units are selected for every segment of synthesis unit being a partition segment of the phonetic sequence. Thus selected speech units are fused together so that a new speech unit is generated. The resulting speech unit is then concatenated for speech synthesis.
An exemplary technique of unit fusion is pitch-cycle waveform averaging. With this technique, the synthesized speech can be increased in stability while sounding like human voice. This is because this technique can reduce the loss of sound quality that often occurs in unit selection based speech synthesizers, caused by a mismatch between the targeted phonetic sequence/prosodic information and the selected speech unit sequence, or by a discontinuity between two consecutive speech units.
As a power control technique for synthesized speech, there is provided a speech synthesis method (refer to JP-A-2001-282276) in which a speech unit is segmented at phoneme boundaries, a power estimation is made for every segment, and the power of the speech unit is changed based on thus estimated power. In a process of power estimation, a pre-calculated parameter such as a coefficient of quantification method of the first type may be used to generate the power.
In the unit selection based speech synthesizers, an optimum speech unit that minimized the cost function is selected from a large number of speech units, but the power of the selected speech unit is not always appropriate. This is why the power discontinuity is noticed, resulting in the loss of sound quality of the synthesized speech. Also in the plural-unit-selection based speech synthesizers, increasing the number of speech units for unit fusion will stabilize the power of the resulting synthesized speech. However, this means that the resulting fused speech unit is generated from many speech units varying in sound quality characteristics, resulting in the increase of sound distortion. Worse still, in the process of unit fusion, fusing speech units having the power considerably different from any appropriate power may cause loss of sound quality.
As such, in the speech synthesis method including the process of power estimation, and using a pre-calculated parameter for power control, it is difficult to perform power control while appropriately reflecting power information of a large number of speech units. With such a method, there may be a possibility of causing a power-speech unit mismatch.
In consideration of the above problems, in speech synthesis of selecting a speech unit or a plurality of speech units, an object of the present invention is to provide a speech synthesis system and method implementing high-quality speech synthesis with natural and stable speech unit power in segments of a phonetic sequence while appropriately reflecting power information of a large number of speech units.