FIG. 11 is a block diagram showing the configuration of a general text-to-speech synthesizer. The text-to-speech synthesizer is mainly composed of a text input terminal 1, a text analyzer 2, a prosody generator 3, a speech segment selector 4, a speech segment database 5, a speech synthesizer 6, and an output terminal 7.
Hereinbelow, description will be given of the operation of a conventional text-to-speech synthesizer. When Japanese Kanji and Kana mixed text information such as words and sentences (e.g., Kanji “left”) is inputted from the input terminal 1, the text analyzer 2 converts the inputted text information “left” to reading information (e.g., “hidari”) and outputs it. It is noted that input text is not limited to a Japanese Kanji and Kana mixed text, and so a reading symbol such as alphabet may be directly inputted.
The prosody generator 3 generates prosody information (information on pitch and volume of speech and speaking rate) based on the reading information “hidari” from the text analyzer 2. Here, information on the pitch of speech is set by pitch of a vowel (basic frequency), so that in the case of this example, pitches of vowels “i”, “a”, “i” are set in order of time. Also, information on the volume of speech and the speaking rate are set by an amplitude and duration of speech waveform per phoneme “h”, “i”, “d”, “a”, “r”, “i”. Thus-generated prosody information is sent to the speech segment selector 4 together with the reading information “hidari”.
Eventually, the speech segment selector 4 refers to a speech segment database 5 for selecting speech segment data necessary for speech synthesis based on the reading information “hidari” from the prosody generator 3. Herein, examples of a widely-used speech synthesis unit include a Consonant+Vowel (CV) syllable unit (e.g., “ka”, “gu”), and a Vowel+Consonant+Vowel (VCV) unit that holds characteristic quantity of a transient portion of syllabic concatenation for achieving high quality sound (e.g., “aki”, “ito”). Hereinbelow, description will be made in the case of using the VCV unit as a basic unit of speech segment (speech synthesis unit).
In the speech segment database 5, there are stored, as the speech segment data, waveforms and parameters obtained by analyzing speech data appropriately taken out by VCV unit from, for example, speech data spoken by an announcer and by converting the form of the data to the form necessary for synthesis processing. In the case of general Japanese text-to-speech synthesis with use of VCV speech segment as a synthesis unit, approx. 800 VCV speech segment data sets are stored. When the reading information “hidari” is inputted in the speech segment selector 4 as in this example, the speech segment selector 4 selects speech segment data containing VCV segments “*hi”, “ida”, “ari”, “i**” from the speech segment database 5. It is noted that a symbol “*” denotes silence. Thus-obtained selection result information is sent together with prosody information to the speech synthesizer 6.
Finally, the speech synthesizer 6 reads corresponding speech segment data from the speech segment database 5 based on the inputted selection result information. Then, based on the inputted prosody information and the above-obtained speech segment data, while the pitch and volume of speech and speaking rate being controlled in accordance with the prosody information, systems of the selected VCV speech segments are smoothly connected in vowel sections and outputted from the output terminal 7. Here, to the speech synthesizer 6, there are widely applied a method generally called waveform overlap-add technique (e.g., Japanese Patent Laid-Open Publication No. 60-21098) and a method generally called vocoder technique or formant synthesis technique (e.g., “Basic Speech Information Processing” P76–77 published by Ohmsha).
The above-stated text-to-speech synthesizer can increase the number of speech qualities (speakers) by changing voice pitch or speech segment database. Also, separate signal processing is applied to an outputted speech signal from the speech synthesizer 6 so as to achieve sound effects such as echoing. Further, it has been proposed that pitch conversion processing, that is also applied to Karaoke and the like, is applied to the output speech signal from the speech synthesizer 6, and an original synthetic speech signal and the pitch-converted speech signal are combined to implement simultaneous speaking by a plurality of speakers (e.g., Japanese Patent Laid-Open Publication No. 3-211597). Also, there has been proposed an apparatus in which the text analyzer 2 and the prosody generator 3 in the above text-to-speech synthesizer are driven by time sharing, and a plurality of speech output portions composed of the speech synthesizer 6 and the like are provided for simultaneously outputting a plurality of speeches corresponding to a plurality of texts (e.g., Japanese Patent Laid-Open Publication No. 6-75594).
In the above conventional text-to-speech synthesizer, changing the speech segment database makes it possible to switch speakers so that a specified text is spoken by various speakers. However, there is a problem that, for example, a plurality of speakers cannot speak the same speech content simultaneously.
Also, as disclosed in the Japanese Patent Laid-Open Publication No. 6-75594, the text analyzer 2 and the prosody generator 3 in the above text-to-speech synthesizer may be driven by time sharing, and a plurality of speech output portions composed of the speech synthesizer 6 and the like may be provided for simultaneously outputting a plurality of voices corresponding to a plurality of texts. However, there is a problem that pre-processing needs to be done by time sharing which leads to complication of the apparatus.
Also, as disclosed in the above Japanese Patent Laid-Open Publication No. 3-211597, the pitch conversion processing may be applied to the output speech signal from the speech synthesizer 6, and a fundamental synthetic speech signal and the pitch-converted speech signal enable a plurality of speakers to speak simultaneously. However, the pitch conversion processing needs processing generally called pitch extraction with a large processing amount, which causes a problem that such apparatus configuration brings about larger processing amount and large cost increase.