It is important to synthesize speech with accurate and natural accent in speech synthesis. Therefore, there is known a concatenative speech synthesis technology as one of speech synthesis technologies. This technology generates synthesized speech by selecting speech segments having similar prosody to the target prosody predicted using a prosody model from a speech segment database and concatenating them. The first advantage of this technology is that it can provide high sound quality and naturalness close to those of a recorded human voice in a portion where appropriate speech segments are selected. Particularly, the fine tuning (smoothing) of prosody is unnecessary in a portion where originally continuous speech segments (continuous speech segments) in speakers original speech can be used for the synthesized speech directly in the concatenated sequence, and therefore the best sound quality with natural accent is achieved.
In the waveform concatenation speech synthesis, however, accurate and natural prosody cannot always be produced by synthesis. It is because the consistency of prosody may be lost as a result of concatenating speech segments selected based on minimizing cost. Particularly in Japanese, a relationship in pitch between moras is recognized as a pitch accent. Therefore, unless the prosody generated as a result of concatenating the speech segments is consistent as a whole, the naturalness of synthesized speech is lost. In addition, the high naturalness of accent cannot always be obtained when continuous speech segments are used for synthesized speech. It is because an accent depends on a context, the frequency of speech may be different according to the context even if the accent is the same, and the prosody may become unnatural at the connection of the accent as a whole in the case of poor consistency with outer portions of the continuous speech segments.
Japanese Unexamined Patent Publication (Kokai) No. 2005-292433 discloses a technology for: acquiring a prosody sequence for target speech to be speech-synthesized with respect to a plurality of respective segments, each of which is a synthesis unit of speech synthesis; associating a fused speech segment obtained by fusing a plurality of speech segments, which are intended for the same speech unit and different in prosody of the speech unit from each other, with fused speech segment prosody information indicating the prosody of the fused speech segment and holding them; estimating a degree of distortion between segment prosody information indicating the prosody of segments obtained by division and the fused speech segment prosody information; selecting a fused speech segment based on the degree of the estimated distortion; and generating synthesized speech by concatenating the fused speech segments selected for the respective segments. Japanese Unexamined Patent Publication (Kokai) No. 2005-292433, however, does not suggest a technique for treating continuous speech segments.
The following document [1] discloses that a speech segment sequence having the maximum likelihood is obtained by learning the distribution of absolute values and relative values of a fundamental frequency (F0) in a prosody model for use in waveform concatenation speech synthesis. Also in the technique disclosed in this document, however, unnatural prosody is produced by the synthesis without speech segments. Although it is possible to use a F0 curve having the maximum likelihood forcibly as the prosody of synthesized speech, the naturalness only possible in the waveform concatenation speech synthesis is lost.
On the other hand, the following document [2] discloses that speech segment prosody is used directly for continuous speech segments since discontinuity never occurs in the continuous speech segments. In this technique, the synthesized speech is used after smoothing the speech segment prosody in the portions other than the continuous speech segments.