This invention relates to a speech signal processing method employed for a speech synthesis system. More particularly, it relates to a speech signal processing method advantageously employed for a post-filter of a speech synthesis system of a multiband excitation (MBE) speech decoder.
There are known a variety of encoding methods for signal compression utilizing statistic characteristics of speech signals in the time domain and in the frequency domain and human psychoacoustic characteristics. These speech encoding methods may be roughly divided into encoding in the time domain, encoding in the frequency domain and synthesis analysis encoding.
As practical example of speech signal encoding, there are known the multiband excitation (MBE) coding, single band excitation (SBE) coding, harmonic coding, sub-band coding, linear predictive coding (LPC), discrete cosine transform (DCT), modified DCT (MDCT) and fast Fourier transform(FFT).
In the speech signal analysis synthesis system, centered about processing in the frequency domain, such as the above-mentioned MBE coding system, it is a frequent occurrence that spectral distortion is produced due to quantization error and signal deterioration becomes acute in a high frequency range having a small number of allocated bits. The result is loss of clarity and nasalized speech due to power loss or disappearance of the high-range formant or power loss in the entire high frequency range. This is particularly the case with the speech of a male speaker having a low pitch and high content of harmonics, in which, if zero-phase addition is made during cosine synthesis, acute peaks are generated at the pitch periods, thus producing nasalized speech.
For compensating such inconvenience, a formant emphasis filter, such as an infinite impulse response filter (IIR), employed for making the compensation in the time domain, is employed. In such case, however, filter coefficients for formant emphasis need be calculated for each speech processing frame, thus rendering real time processing difficult. In addition, it is necessary to take account of filter stability, such that it is not possible to derive the effect proportionate to the quantity of the arithmetic-logic operations.
If suppression of the spectral valleys in the low frequency range is performed perpetually, a modulated noise sound like "shuru-shuru" is produced in the unvoiced (UV) domain. On the other hand, if formant emphasis is perpetually performed, there is produced spectral distortion by side effects which will give an impression as if two speakers were talking simultaneously.