This invention relates to a speech signal processor.
Attention has been drawn to techniques for extracting feature parameters such as spectral information and excitation source information from the speech signal to transmit them with reduced transmission bit rate. Of these techniques, the linear predictive coding (LPC) technique is extensively used because of its simple processing. The LPC technique involves extracting linear predictive coefficients as spectral information and predictive residual as excitation source information from the speech signal on the transmission side, and on the receiver side, determining weight coefficient with spectral information and exciting a synthesizing filter by the excitation source information to synthesize reproduced speech. The speech synthesizer for such an LPC technique is usually provided with a synthesizing filter including a feedback loop. This makes the circuit construction complex and reduces the stability of the synthesizing filter due to transmission error and other causes.
Under the circumstances, Sagayama et al., proposed a structurally very simple synthesizer needing to filter. Reference is made, for example, to "Composite Sinusoid Modeling Applied to Spectrum Analysis of Speech" Data S79-06 (May, 1979) and "Speech Synthesis by Composite Sinusoidal Wave" Data S79-39 (Oct., 1979) Laboratory of Speech. The Acoustical Society of Japan. This technique is termed CSM (acronym for Composite Sinusoid Model).
The CSM represents the speech signal as the summation or combination of a set of sinusoidal waves each having amplitude and frequency as parameters freely selectable. The number of these sinusoidal waves suitable for use is predetermined to be at the largest 4-6. For CSM analysis, frequency and amplitude (CSM parameters) of each sinusoidal wave are determined every analysis frame so that the lowest N order autocorrelation coefficients directly calculated from the speech signal is equal to the lowest N order autocorrelation coefficients of the corresponding synthesized wave.
Simple summation (combination) of the CSM signals of every frequency cannot reproduced the corresponding original speech. For reproducing original speech, it is necessary to attach pitch structure and impart a pich synchronous envelope to the summed CMS signal. The term "attachment of the pitch structure" means that the phase of sinusoidal wave is initialized to "0" every pitch period for voiced speech. This is done to make the line spectrum structure spread approach the natural speech spectrum. Also for unvoiced speech, line spectrum structure is spread by random phase initialization. The signal imparted with pitch structure as mentioned above is useful to obtain synthesized sound like speech. Initialization of sinusoidal wave phase to zero is accompanied by discrete jumps in the waveform. To smoothen out such jumps, the synthesized speech signal is multiplied an envelope synchronous with the pitch of the speech signal, such an envelope attenuation curve according to an exponential function.
Additionally, it is problematic whether the interval for phase initialization mentioned above is too narrow or wide. Too narrow initialization interval causes whitening, and in turn no occurrence of a spectrum envelope, while too wide initialization interval is associated with an insufficient frequency spread to obtain an appropriate spectral envelope. There has been problems in the conventional CSM technique also in that because of the application of random phase initialization for production of unvoiced sound, initialization is inevitably performed both at too narrow and too wide intervals with a resulting failure in obtaining good unvoiced speech.
In the conventional CSM technique, CSM parameters yielded by the analysis such as frequency and amplitude representing characteristics of the individual sinusoidal waves are quantized separately, leaving relationship between parameters out of consideration. This reflects in inadequate quantization to utilize characteristics of CSM parameters, and produces problems in quantization efficiency.
At present digital privacy telephone system are widely used in which generally the analog speech signal is converted into digital codes, followed by a specified coding, to maintain information of the original speech secret before transmission, and the received signals are decoded just inversely to the coding, followed by D/A conversion to reproduce the corresponding original speech signal. Such a digital communication system has the disadvantage of requiring high performance of the transmission line, such as transmission capacity and error rate.
There is also, for example, an analog privacy telephone system of subjecting the speech signal to spectral inversion or to spectral division and interchange of relative positions before transmission. It generally requires low transmission rates but the spectrum envelope of the original speech signal remains in some form, which contributes to defeat the privacy of the system.