An apparatus that generates a speech waveform from speech feature parameters is called a speech synthesizer. As an example of speech synthesizer, a source-filter type speech synthesizer is used. The source-filter type speech synthesizer receives a sound source signal (excitation source signal), which is generated from a pulse source signal representing sound source components generated by vocal cord vibrations and a noise source signal representing sound sources originated from turbulent flows of air or the like, and generates a speech waveform by filtering using parameters of a spectrum envelope representing vocal tract characteristics or the like. A sound source signal can be created by simply using a pulse signal and a Gaussian noise signal and switching these signals. The pulse signal is created according to pitch information obtained from a fundamental frequency sequence and is used in a voiced sound interval. The Gaussian noise signal is used in an unvoiced sound interval. As a vocal tract filter, an all-pole filter with a linear prediction coefficient used as a spectrum envelope parameter, a lattice-type filter for the PARCOR coefficient, an LSP synthetic filter for an LSP parameter, or a Logarithmic Magnitude Approximate (LMA) filter for a cepstrum parameter is used. As a vocal tract filter, a mel all-pole filter for mel LPC, an Mel Logarithmic Spectrum Approximate filter (MLSA for mel cepstrum), or an Mel Generalized Logarithmic Spectrum Approximate (MGLSA) filter for mel generalized cepstrum is also used.
A sound source signal used for such a source-filter type speech synthesizer can be created by, as described above, switching a pulse sound source signal and a noise source signal. However, when the simple switching of the pulse and noise is applied to a signal such as a voiced fricative, in which a noise component and a periodic component are mixed such that a higher frequency domain becomes a noise-like signal and a lower frequency domain a periodic signal, voice quality becomes unnatural with a buzzing or a rough quality of generated sound.
To deal with this problem, a technology like Mixed Excitation Linear Prediction (MELP) to prevent degradation by a buzz or a buzzer-like sound generated by switching between a band higher than a certain frequency regarded as a noise source and a lower band regarded as a pulse sound source is proposed. Also, to create a mixed sound source appropriately, a technology that divides a signal into sub-bands and mixes a noise source and a pulse sound source for each sub-band according to a mixing ratio is used.
However, the conventional technologies have a problem in that a waveform cannot be generated at high speed because a band-pass filter is applied to a noise signal and a pulse signal when a reproduced speech is generated.