The present invention concerns digital speech signal processing techniques.
Many representations of speech signals take account of the harmonic content of such signals resulting from the manner in which they are produced. In most cases, this is reflected in the determination of a pitch frequency of the speech signal.
Digital processing of speech signals has recently expanded greatly in varied domains: speech coding for transmission and storage, speech recognition, noise reduction, echo cancellation, etc. Such processing very frequently uses an estimate of the pitch frequency and particular operations related to the estimated frequency.
Many methods have been developed for estimating the pitch frequency. One method that is routinely used is based on linear prediction which evaluates a prediction delay which is inversely proportional to the pitch frequency. The delay can be expressed as an integer or fractional number of digital signal sample times. Other methods detect directly breaks in the signal which can be attributed to glottal closures of the speaker, the time intervals between such breaks being inversely proportional to the pitch frequency.
If the digital speech signal is transformed into the frequency domain, as by a discrete Fourier transform, it is necessary to consider a discrete spectrum of the speech signal. The discrete frequencies considered are of the form (a/N)xc3x97Fe, where Fe is the sampling frequency, N is the number of samples of the blocks used in the discrete Fourier transform and a is an integer from 0 to N/2xe2x88x921. These frequencies do not necessarily include the estimated pitch frequency and/or its harmonics. This causes inaccuracy in operations relating to the estimated pitch, which can cause distortion of the processed signal, affecting its harmonic character.
A principal object of the present invention is to propose a method of conditioning the speech signal which makes it less sensitive to the above drawbacks.
The invention therefore proposes a method of conditioning a digital speech signal processed by successive frames, wherein harmonic analysis of the speech signal is performed to estimate a pitch frequency of the speech signal over each frame in which it features vocal activity. After estimating the pitch frequency of the speech signal over one frame, the speech signal of the frame is conditioned by oversampling it at an oversampling frequency which is a multiple of the estimated pitch frequency.
In processing the speech signal, this enables the frequencies closest to the estimated pitch to be favoured over other frequencies. The harmonic character of the speech signal is therefore preserved as far as possible. To compute spectral components of the speech signal, the conditioned signal is distributed between blocks of N samples which are transformed into the frequency domain and the ratio between the oversampling frequency and the estimated pitch frequency is chosen as a factor of the number N.
The foregoing technique can be refined by estimating the pitch frequency of the speech signal over a frame in the following manner:
estimating time intervals between two consecutive breaks of the signal which can be attributed to glottal closures of the speaker occurring during the frame, the estimated pitch frequency being inversely proportional to said time intervals;
interpolating the speech signal in said time intervals, so that the conditioned signal resulting from such interpolation has a constant time interval between two consecutive breaks.
This approach artificially constructs a signal frame over which the speech signal features breaks at constant intervals. Any variations of the pitch over the duration of a frame are therefore taken into account.
In a further improvement, after processing each conditioned signal frame, a number of the signal samples supplied by such processing is retained which is equal to an integer multiple of the ratio between the sampling frequency and the estimated pitch frequency. This avoids the distortion problems caused by phase discontinuities between frames, which are generally not totally corrected by conventional overlap-add techniques.
Using the oversampling technique to condition the signal yields a good measurement of the degree of voicing of the speech signal over the frame, based on the entropy of the autocorrelation of the spectral components computed on the basis of the conditioned signal. The greater the disturbance of the spectrum, i.e. the more it is voiced, the lower the entropy values. Conditioning the speech signal accentuates the irregularity of the spectrum and therefore the entropy variations, with the result that the latter constitutes a measurement of good sensitivity.
In the remainder of this description, the conditioning method according to the invention is illustrated in a system for suppressing noise in a speech signal. Clearly the method can find applications in many other types of digital speech processing: coding, recognition, echo cancellation, etc.