In speech synthesis apparatuses that produce synthetic speech on the basis of text data, a speech synthesis method which pastes and modifies synthesis units at desired pitch intervals while copying and/or deleting them in units of pitch waveforms (PSOLA: Pitch Synchronous Overlap and Add), and produces synthetic speech by concatenating these synthesis units is becoming popular today.
Synthetic speech produced by exploiting such technique contains a distortion due to modifying of synthesis units (to be referred to as a modification distortion hereinafter) and a distortion due to concatenations of synthesis units (to be referred to as a concatenation distortion hereinafter). Such two different distortions seriously cause deterioration of the quality of synthetic speech. When the number of synthesis units that can be registered in a synthesis unit inventory is limited, it is nearly impossible to select synthesis units which reduce such distortions. Especially, when only one synthesis unit can be registered in a synthesis unit inventory in correspondence with one phonetic environment, it is totally impossible to select synthesis units which reduce the distortions. If such synthesis unit inventory is used, the quality of synthetic speech deteriorates inevitably due to the modification and concatenation distortions.