A speech synthesizer that generates speech from, text includes three main processors, i.e., a text analyzer, a prosody generator, and a speech signal generator. The text analyzer performs a text analysis of input text (a sentence including Chinese characters and kana characters) using, for example, a language dictionary and outputs linguistic information (also referred to as linguistic features) such as phoneme strings, morphemes, readings of Chinese characters, positions of stresses, and boundaries of segments (stressed phrases). On the basis of the linguistic features, the prosody generator outputs prosody information including a time variation pattern (hereinafter referred to as a pitch envelope) of the pitch of the speech (basic frequency) and the length (hereinafter referred to as the duration) of each phoneme. The prosody generator is an important device that contributes to the quality and the overall naturalness of synthetic speech.
A technique is proposed in U.S. Pat. No. 6,405,169 in which a generated prosody is compared with the prosody of speech units used in a speech signal generator, and the prosody of speech units is used when the difference therebetween is small in order to reduce distortion of synthetic speech. A technique is proposed in “Multilevel parametric-base F0 model for speech synthesis” Proc. Interspeech 2008, Brisbane, Australia, pp. 2274-2277 (Latorre, J., Akamine, M.) in which pitch envelopes are modeled for phonemes, syllables, and the like and a pitch envelope pattern is generated from the plural pitch envelope models to thereby generate a natural pitch envelope that varies smoothly.
On the basis of the linguistic features from the text analyzer and the prosody information from the prosody generator, the speech signal generator generates a speech waveform. Currently, a method called a concatenative synthesis method is generally used, which can synthesize relatively high quality speech.
The concatenative synthesis method includes selecting speech units on the basis of linguistic features determined by the text analyzer and the prosody information generated by the prosody generator, modifying the pitches (basic frequencies) and the durations of the speech units on the basis of the prosody information, concatenating the speech units, and outputting synthetic speech. The speech quality is significantly reduced by the modification of the pitches and the durations of the speech units.
A method is known for coping with this problem in which a large-scale speech unit database is provided and speech units are selected from a large number of speech unit candidates with various pitches and durations. By using this method, modifications to pitch and duration can be minimized, the reduction in the speech quality due to the modifications can be suppressed, and speech synthesis with high quality can be achieved. However, this method requires an extremely large database for storing speech units.
There is also a method in which the pitches and the durations of selected speech units are used without modifying the pitches and the durations of the speech units. This method can avoid any reduction in the speech quality due to modifications to pitch and duration. However, the continuity of pitches of the selected and concatenated speech units is not necessarily guaranteed, and discontinuous pitches degrade the naturalness of synthetic speech. To improve the naturalness of the pitches and the durations of the speech units, the number of types of speech units needs to be increased, and this requires an extremely large database for storing the speech units.