A technique of artificially generating a speech signal from an arbitrary document (text) is called text-to-speech synthesis. The text-to-speech synthesis is implemented by three steps, i.e., language processing, prosodic processing, and speech signal synthesis processing.
In language processing serving as the first step, an input text undergoes morphological analysis, syntax analysis, and the like. In prosodic processing serving as the second step, processing regarding the accent and intonation is performed based on the language processing result, outputting a phoneme sequence (phoneme symbol sequence) and prosodic information (e.g., fundamental frequency, phoneme duration, and power). Finally in speech signal synthesis processing serving as the third step, a speech signal is synthesized based on the phoneme sequence and prosodic information.
The basic principle of a kind of text-to-speech synthesis is to connect feature parameters called speech segments. The speech segment is the feature parameter of relatively short speech such as CV, CVC, or VCV (C is a consonant and V is a vowel). An arbitrary phoneme symbol sequence can be synthesized by connecting prepared speech segments while controlling the pitch and duration. In the text-to-speech synthesis, the quality of usable speech segments greatly influences that of synthesized speech.
A speech synthesis method described in Japanese Patent Publication No. 3732793 expresses a speech segment using, e.g., a formant frequency. In this speech synthesis method, a waveform representing one formant (to be simply referred to as a formant waveform) is generated by multiplying a sine wave having the same frequency as the formant frequency by a window function. A plurality of formant waveforms are superposed (added), synthesizing a speech signal. The speech synthesis method in Japanese Patent Publication No. 3732793 can directly control the phoneme or voice quality and thus can relatively easily implement flexible control such as changing the voice quality of synthesized speech.
The speech synthesis method described in Japanese Patent Publication No. 3732793 can shift the formant to a high-frequency side to make the voice of synthesized speech thin or shift it to a low-frequency side to make the voice of synthesized speech deep by converting all formant frequencies contained in speech segments using a control function for changing the depth of a voice. However, the speech synthesis method described in Japanese Patent Publication No. 3732793 does not synthesize interpolated speech based on a plurality of speakers.
A speech synthesis apparatus described in Japanese Patent Publication No. 2951514 generates interpolated speech spectrum data by interpolating speech spectrum data of a plurality of speakers using predetermined interpolation ratios. The speech synthesis apparatus described in Japanese Patent Publication No. 2951514 can control the voice quality of synthesized speech using even a relatively simple arrangement.
The speech synthesis apparatus described in Japanese Patent Publication No. 2951514 synthesizes interpolated speech based on a plurality of speakers, but the quality of the interpolated speech is not always high because of its simple arrangement. In particular, the speech synthesis apparatus described in Japanese Patent Publication No. 2951514 may not obtain interpolated speech with satisfactory quality upon interpolating a plurality of speech spectrum data differing in formant position (formant frequency) or the number of formants.