A speech synthesis device has been known which generates a speech waveform from input text. The speech synthesis device generates synthesized speech corresponding to the input text mainly through a text analysis process, a prosody generation process, and a waveform generation process. As a speech synthesis method, there are speech synthesis based on unit selection and speech synthesis based on a statistical model.
In the speech synthesis based on unit selection, speech units are selected from a speech unit database and are then concatenated to generate a waveform. Furthermore, in order to improve stability, a plural-unit selection and fusion method is used, the method includes selecting a plurality of speech units for each synthesis unit, generating speech units from the plurality of selected speech units using, for example, a pitch-cycle waveform averaging method, and concatenating the speech units. As a prosody generating method, for example, the following methods may be used: a duration length generation method based on a sum-of-product model; and a fundamental frequency sequence generation method using a fundamental frequency pattern code book and offset prediction.
As the speech synthesis based on the statistical model, speech synthesis based on an HMM (hidden Markov model) has been proposed. In the speech synthesis based on the HMM, the HMM which corresponds to a synthesis unit is trained from a spectrum parameter sequence, a fundamental frequency sequence, or a band noise intensity sequence calculated from speech and parameters are generated from an output distribution sequence corresponding to input text. In this way, a waveform is generated. A dynamic feature value is added to the output distribution of the HMM, and a parameter generation algorithm considering the dynamic feature value is used to generate a speech parameter sequence. In this way, smoothly concatenated synthesized speech is obtained.
Converting the quality of an input voice into a target voice quality is referred to as voice conversion. The speech synthesis device can generate synthesized speech close to a target voice quality or prosody using the voice conversion. For example, it is possible to convert a large amount of voice data obtained from an arbitrary uttered voice so as to be close to the target voice quality or prosody using a small amount of voice data obtained from a target uttered voice and generate speech synthesis data used for speech synthesis from a large amount of converted voice data. In this case, when only a small amount of voice data is prepared as target voice data, it is possible to generate synthesized speech which reproduces the features of the target uttered voice.
However, in the speech synthesis device using conventional voice conversion, during speech synthesis, only voice data generated by the voice conversion is used, but voice data obtained from the target uttered voice is not used. Therefore, similarity to the target uttered voice is likely to be insufficient.