In ASR systems, when recognition is performed locally for voice control of "hands-free" telephones, computers, data terminals, or the like, known techniques seek to reduce disturbances introduced by additive noise. They include, in particular, filtering by spectrum subtraction, antenna filtering, Markov model state filtering, or in-line addition of room noise to reference models.
Markov state filtering consists in applying a spectrum subtraction filter (a Wiener filter) knowing the Markov model of the speech and the most probable state in which the system is to be found at an instant t. The clean signal model is given by the state of the Markov model, and the noise model is estimated from the silences preceding the word from which noise is to be removed.
For centralized recognition, the purpose of known techniques is to reduce the effects of telephone lines by subtracting the DC component from cepstrum vectors as estimated over a sufficiently broad horizon. For a digital telephone signal subdivided into windows, the notion of "horizon" designates a given integer number of successive windows. For a more detailed description of that type of approach, reference may be made to the article by C. Mokbel, J. Monne, and D. Jouvet, entitled "On-line adaptation of a speech recognizer to variations in telephone line conditions", Eurospeech, pp. 1247-1250, Berlin 1993. For a horizon that is broad enough, it is observed that the mean of the cepstrum vectors represents the effects of telephone lines, with this observation being particularly true when changes of channel characteristics take place slowly.
In general, a system for removing noise or for equalization is based on knowing the characteristics of the clean signal and the characteristics of the noise or the disturbances. Unfortunately, the system is much more complicated if the model of the clean system or its parameters are unknown.
For example, if it is assumed that a segment of clean speech is the output from an autoregressive system whose parameters are unknown, an "estimate-maximize" (EM) type method can be used for removing noise so as to obtain an estimate of the parameters of the autoregressive model and so as to filter out disturbances (see for example the article by G. Celeux and J. Diebolt, entitled "Une version de type recuit simule de l'algorithme EM" A simulated annealing type version of the EM algorithm!, Rapports de Recherche No. 1123, Programme 5, INRIA, November 1989).
It is also possible to use blind equalization which is based on the statistics specific to the digital signal to determine the criterion for adapting the coefficients of the equalizer that performs equalization. In particular, document FR-A-2 722 631 describes an adaptive filter method and system using blind equalization of a digital telephone signal and the application thereof to telephone transmission and/or to ASR. The method described in that document is based entirely on general statistics relating to the speech signal and on the assumption of the telephone channel has a convolutive effect that is almost constant.
Such approaches give satisfactory results if simple assumptions can be made about the clean signal, i.e. if it can be assumed to be autoregressive and/or Gaussian and/or steady, but that is not always possible.
Other recent studies seek to use statistical vocabulary models in order to reduce the disturbances and/or variability of the speech signal, thereby enabling recognition to be more robust.
All of the above work suffers from the drawback of being incapable of providing an in-line application in a manner that is synchronous with the sound frame. The methods proposed wait until the end of the signal that is to be recognized, and then perform iterations for estimating biases before identifying the signal after noise removal or equalization. Further, estimators of the bias to be subtracted depend directly, or indirectly in a "estimate-maximize" method, on the best path in the Markov model, where a path or alignment in the Markov sense is an association between a run of sound frames and a run of states (or transitions) to which the probability densities of the model correspond. This dependency risks biasing the approach if the initial observation is highly disturbed, since such disturbances can give rise to false alignments.
The approach proposed by the present invention differs fundamentally from the approaches described previously, it is more general, and it remedies the above-mentioned drawbacks in that it is synchronous with the sound frame.