This invention relates to low-bit-rate digital encoding procedures used for vocoder speech inputs, which do not reproduce the original form of the speech signal, but rather certain parameters enabling the excitation signal and the characteristics of a filter producing a synthetic speech signal audibly resembling the original speech input to be defined over successive sampling instants or time frames. Specifically, it concerns a multipulse method of generating the filter excitation signal.
The filter models the vocal tract, which is assumed to be invariant over short time spans of the order of 20 ms. It reproduces the spectrum of short term frequencies of the speech signal, and especially the latter's maxima or formants, which are more readily perceived by the human ear than its minima. This filter can be designed using various analog or digital means, to provide coding by channels, formants or linear prediction.
The excitation signal necessary to the vocal tract modeling filter, or synthesis filter, to synthesize a speech signal, must simulate the vocal excitation signal. The oldest known way of developing this signal consists in using two switched sources:
a source of periodic pulses at the frequency of the fundamental of the original speech signal (pitch), used for voiced sounds (vowels)
and a noise source, used for unvoiced sounds (fricatives).
This mode of signal generation raises the problem of effectively distinguishing between voiced and unvoiced sounds. It finally yields an excitation signal bearing only a loose relation with the vocal excitation signal, which produces via the synthesis filter a synthetic speech signal of low fidelity, that is sometimes poorly intelligible.
There is another known way of generating the excitation signal for the synthesis filter, taught particularly in U.S. Pat. No. 4,472,832, which gives this signal a waveform more like that of the vocal excitation signal in order to obtain a synthetic speech signal of greater fidelity. This method consists in generating, for the purpose of exciting the vocal tract modeling synthesis filter, a signal made up of pulses whose positions and amplitudes in each time frame are adjusted so as to minimize therein the differences between the synthesized speech signal and the signal of the speech to be encoded. Such minimizing is carried out according to the criterion of mean-squared error minimization within the time frame under consideration with a so-called perceptual weighting of the error taking into account the human ear's lesser sensitivity to distortions in the format regions of the speech frequency spectrum having a relatively high energy concentration.
Minimization based on the mean-squared error must be obtained with a minimum number of pulses to limit as much as possible the bit rate required for transmitting the coded speech. Lacking a direct solution to this problem, it is necessary to choose discrete locations where it is possible to place pulses and to proceed by successive approximation, so defining at each stage the weighted mean-squared error resulting from the pulsed signal adopted for the previous stage, to which is added a new pulse of unknown amplitude and position, determining at this time the possible position of the new pulse and the value of amplitude which cancels the partial derivative of said weighted mean-squared error with respect to said amplitude, taken as an independent variable, and then choosing the position of the pulse for which said weighted mean-squared error is smallest and adopting as pulsed signal for the given stage that signal used for the previous stage plus thus defined.
The successive approximation process is stopped after a certain number of iterations determined according to the available computing capacities and the encoding bit rate.
The disadvantage of this approach is that it accumulates the errors and thus causes a degrading of the signal-to-noise ratio of the synthetic speech signal that is particularly evident when synthesizing high-pitched voices.
To obviate this disadvantage, it has been proposed to recalculate the optimal amplitudes of all the pulses (reoptimize) once their positions have been determined. However, this solution entails solving a system of linear equations, which substantially increases the number of computations required to determine the excitation signal and makes solution rather impractical.