The present invention relates to a speech synthesis apparatus for embedding, in an exemplary text segment including a fixed form portion having fixed contents and an unfixed form portion having varying contents, an arbitrary text segment which is specified by a user to the position of the unfixed form portion and generating synthesized speech of the exemplary text segment having the text segment embedded therein, and a method therefor.
In recent years, a variety of speech synthesis apparatuses for analyzing text in mixed Japanese letters and Chinese characters, synthesizing speech information of the text by synthesis by rule, and outputting voiced speech have been developed.
The basic arrangement of a speech synthesis apparatus of this type employing the synthesis-by-rule method is as follows. Speech utterances are analyzed in predetermined units, e.g., in units of CVs (consonant/vowel), CVCs (consonant/vowel/consonant), VCVs (vowel/consonant/vowel), or VCs (vowel/consonant) by LSP (line spectrum pair) analysis or cepstrum analysis to obtain phonetic information. The phonetic information is registered in a speech segment file. On the basis of this speech segment file and synthesis parameter (phonetic string and prosodic information) obtained upon analyzing text, voice source generation and synthesis filtering are performed to generate synthesized speech.
In text-to-speech synthesis by rule, a phonetic string and prosodic information are generated by analyzing text. Since both the phonetic string and the prosodic information are generated by rule, the resultant speech always has unnatural portions because of the imperfection of rule.
When text the sounds of which are to be produced is determined in advance, a technique called analysis synthesis is used. In this technique, the text is actually uttered by a person and analyzed to generate various parameters, and speech is synthesized using the parameters. Since a higher quality parameter than that in synthesis by rule can be used for speech synthesis, more natural speech can be synthesized.
In some application fields, it is required to change part of text using the synthesis-by-rule method and synthesize the remaining portion using a parameter generated by analysis. In this case, speech more natural than that obtained by synthesizing the full text by rule can be obtained while partially taking advantage of the flexibility of synthesis by rule.
In this prior art, however, even when speech is synthesized by rule using only text to be embedded as a synthesis-by-rule portion, and the resultant portion is concatenated to the remaining portion based on analysis, no natural concatenation can be obtained.
For example, for a sentence "Mr. Tanaka is waiting" ("/ta/na/ka/sa/ma/ga/o/ma/chi/de/go/za/i/ma/su/" in Japanese), "Mr. Tanaka" ("/ta/na/ka/sa/ma/ga/" in Japanese) is synthesized by rule, and "is waiting" ("/o/ma/chi/de/go/za/i/ma/su/" in Japanese) is synthesized on the basis of analysis. If "/ta/na/ka/sa/ma/ga/" is synthesized by rule without considering that "/o/ma/chi/de/go/za/i/ma/su/" follows the portion, the synthesized speech sounds as if the sentence ended at that portion ("/ta/na/ka/sa/ma/ga/"). When "/o/ma/chi/de/go/za/i/ma/su/" is spoken after that portion, unnatural speech is obtained.