This invention was supported in part by a grant from the National Science Foundation (IRI-8720403) and in part by SRI International of Menlo Park, Calif.
This invention relates to speech recognition in the presence of noise, and more particularly to a method for preprocessing speech for use in connection with a speech recognition system.
Speech recognition systems are very sensitive to differences between a training condition which is free of noise and operating conditions in the presence of noise. In particular, speech recognition systems can be trained to recognize specific speech patterns in the absence of noise and are therefore trained on high-quality speech. However, such systems degrade drastically in noisy environments.
Several methods for handling this problem are known, among them are methods of supplementing the acoustic preprocessing of a speech recognizer with a statistical estimator. A statistical estimator as used herein is intended to provide to a speech recognizer input values or signals which can be assumed to be clean speech information.
The task of designing a statistical estimator for speech recognition is that of defining an optimality criterion that will match the recognizer and of deriving an algorithm to compute the estimator based on this criterion. Defining the optimality criterion is easier for speech recognition than it is for speech enhancement for human listeners, since the signal processing technique is known in the former but not in the latter. For a recognition system which is based on a distance metric, whether for template matching or vector quantization, it is reasonable to assume that the optimality criterion is to minimize the average distortion as measured by that distance metric. Achieving this criterion is frequently computationally infeasible.
With discrete Fourier transform (DFT), filter-bank based systems, the distance measure which is typically used is a weighted Euclidean distance on the cosine transform of the logarithm of the output energy of the filters, often referred to as the "liftered cepstral distance." (The cepstrum in a filter-bank system is defined as a transform of the filter energies.) Achieving this estimation criterion using this distance metric is computationally difficult with additive noise. Published estimation algorithms which have been applied to filter-bank based systems are the minimum mean square error (MMSE) algorithm and the spectral subtraction algorithm, applied to either discrete Fourier transform (DFT) coefficients or filter-bank output energies. (Reference to Porter et al. and Van Compernolle 1 and 2 discussed below.) A basic difference between the multiple-dimensional cepstral distance optimality criterion and the single frequency channel minimum mean square error (MMSE) distance criterion is that the cepstral distance implies a joint estimation of a feature vector whereas the MMSE distance implies an independent estimation of scalar quantities. Because the speech spectral energies at different frequencies are in fact correlated, use of an independent estimate of individual frequency channels results in suboptimal estimation.
This art presumes a basic familiarity with statistics and Markov processes, as well as familiarity with the state of the art in speech recognition systems using hidden Markov models. By way of example of the state of the art, reference is made to the following patents and publications, which have come to the attention of the inventors in connection with the present invention. Not all of these references may be deemed to be relevant prior art.
______________________________________ Inventor ______________________________________ U.S. Pat. No. Issue Date Bahl et al. 4,817,156 03/28/89 Levinson et al. 4,587,670 05/06/86 Juang et al 4,783,804 11/08/88 Bahl et al. 4,741,036 04/26/88 Foreign Pat. No. Pub. Date Sedgwick et al. EP 240,330 10/07/87 ______________________________________