Speech synthesizers that generate synthesized speech corresponding to an input text have been known. As one of the conventional speech synthesizers, there has been a waveform-concatenation based speech synthesizer that synthesizes a speech by preparing a database of a large amount of speech waveforms, selecting speech elements from the speech waveform database in accordance with the input text, and concatenating the selected speech elements. Furthermore, a multiple-segment selecting speech synthesizer that enhanced a sense of stability by selecting a plurality of speech segments for each section and generating a speech waveform from the selected speech elements has also been known. In such waveform-concatenation based speech synthesizers, in a properly selected section, a high quality synthesized speech that is like a recorded speech can be obtained. However, the decrease in naturalness due to mismatch between the selected speech segment and prosody, and distortion caused by discontinuity of the adjacent speech segments arise as a problem.
Meanwhile, as a statistical model-based speech synthesizer, an HMM-based speech synthesizer that trains a hidden Markov model (HMM) from acoustic feature parameters that are obtained from the speech database by analysis and synthesizes a speech based on the HMM that was trained has been proposed and been used widely. In the HMM-based speech synthesis, a speech is synthesized by obtaining a distribution sequence in accordance with the input text, and generating feature parameters from the obtained distribution sequence. However, as a problem in the HMM speech synthesis, due to synthesizing speech from the averaged feature parameters, it includes the occurrence of over-smoothing that results in a synthesized speech of unnatural sound quality.