The conventional noise suppression technique for speech recognition may roughly be classified into the following two types.
(a) The noise component is subtracted from an input signal using a signal processing technique.
(b) An acoustic model and a noise model are synthesized on a decoder to create a noise adapted acoustic model.
Meanwhile, in the present specification, the noise designates a signal other than the speech signal, and includes, in addition to a background noise, thought to be relatively stationary, the unexpectedly occurring noise, reverberation, echo and the speech of speaker other than a target speaker, for example.
According to Patent Document 1, the techniques (a) and (b) are classified as the technique by the front end and processing by a decoder, respectively.
A method widely used as the signal processing technique (a) is a “spectrum subtraction method (abbreviated as SS method)”.
FIG. 10 is a diagram showing a typical configuration of a system for implementing this SS method. Referring to FIG. 10, the system includes an input signal acquisition unit 1 for acquiring an input signal (spectrum X), a unit 2 for calculating a noise mean spectrum (N), and a unit 3c for subtracting the noise mean spectrum from the input signal to calculate an estimate speech (provisional estimate speech S′).
The system of this configuration has the following advantages.
An amount of computation is small.
The system may readily be used in combination with other techniques, such as a technique of updating the noise mean spectrum.
However, if the noise mean spectrum is simply subtracted from the input signal, the residual noise in the subtraction (musical noise) is generated due to variance components of the noise or to the phase difference between the speech and the noise. Such residual noise may give rise to recognition error.
Thus, in the SS method, it is necessary to carry out flooring by way of processing for burying the information in the valley of the speech. In case the flooring level is increased, the residual noise, generated in the subtraction process, may be suppressed, however, the performance may be degraded because the information in the valley of the speech has been buried.
In Patent Document 1, Non-Patent publication 2 and in Non-Patent publication 6, there is disclosed a technique of calculating a noise reducing filter using a smoothed a priori SNR (estimate speech divided by the noise mean spectrum).
Referring to FIG. 11, this system includes, in addition to the configuration shown in FIG. 10, a unit 6 for calculating a noise reducing filter and a unit 7 for calculating the estimate speech. The system of FIG. 11 uses smoothing to reduce the residual noise, which is of a problem inherent in the above SS method.
If smoothing is carried out thoroughly, the residual noise in the subtraction may be suppressed, however, there persist problems such as                dropout of the beginning portion of the speech and        difficulties met in detecting the terminal portion of the speech.        
That is, the signal processing technique suffers from the following problem:                Processing such as flooring or smoothing is which leads to dropout of the information of the original speech, has to be carried out.        If, as the residual noise, generated in the subtraction process, is suppressed, the information dropout is to be reduced to a minimum, it is necessary to carry out parameter tuning, depending on the sort of the noise and on the SNR.        
It is therefore difficult to make universal use of the signal processing technique.
Turning to the technique of (b) for adapting the acoustic model to the noise, there is widely known the “Parallel Model Combination (PMC) Method” disclosed in Non-Patent Document 3.
This technique uses a unit for formulating a noise model, an acoustic model HMM, learned in advance in a noise-free environment, a unit for transforming the noise model to a linear spectrum, and a unit for transforming the acoustic model HMM to linear spectrum. The technique also uses a unit for adding the noise model, transformed into the linear spectrum, and the acoustic model HMM, also transformed into the linear spectrum, to formulate a noise adapted acoustic model HMM, and a unit for transforming the so formulated noise adapted model to cepstrum.
The system of this configuration has the following advantages.
That is, since the acoustic model HMM has been adapted to the noise, recognition may be achieved without dependency on the sort of the noise or on the SNR.
However, there persist the following problems.
The computation for formulating the noise adapted acoustic model NMM is extremely costly.
It is not that easy to use the technique in combination with other techniques, such as the technique for updating the noise mean spectrum.
As a method for adapting not the acoustic model but reference pattern GMM (Gaussian Mixture Model) of the speech to the noise, the “method for speech signal estimation by GMM” has been proposed in Non-Patent Document 4.
Referring to FIG. 12, this technique uses an input signal acquisition unit 1, for acquiring an input signal X, a unit 2 for calculating the noise mean spectrum, and reference pattern 4 of the speech, learned in advance in a noise-free environment. The technique also uses a noise adapted pattern formulating unit 9, for formulating noise adapted pattern, the noise adapted pattern 10, and a unit 11 for calculating an expected value of the amount of movement of mean vectors of the noise pattern and the reference pattern. The technique also uses a calculation unit 7a for calculating the estimate speech S.
The system, configured as described above, has the following merit.
That is, the system is able to perform speech recognition with high stability by replacing the operation of subtracting the noise component, which has been of a problem in the above-described signal processing technique, by the operation of finding the expected value of the variance G between the reference pattern and the noise adaptive patterns.
Similarly to the PMC method, the system, having the above configuration, suffers from the following problem.
The computation for formulating the noise adaptive acoustic model NMM is extremely costly.
It is not that easy to use the system in combination with other techniques, such as the technique of updating the noise mean spectrum.    [Patent Document 1]    JP Patent Kohyo Publication No. JP-P2004-520616A    [Non-Patent Document 1]    Hiroshi Matsumoto, “Speech Recognition Techniques for Noisy Environments”, Information Science Technological Forum FIT2003, Sep. 10, 2003    [Non-Patent Document 2]    Y. Ephraim. D. Malah, “Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator”, IEEE Trans. on ASSP-32, No. 6, pp. 1109-1121, December 1984    [Non-Patent Document 3]    M. J. F. Gales and S. J. Young, “Robust Continuous Speech Recognition Using Parallel Model Combination”, IEEE Trans. SAP-4, No. 5, pp. 352-359, September 1996    [Non-Patent Document 4]    J. C. Segura A. de la Torre, M. C. Benitez and A. M. Peinado “Model-Based Compensation of the Additive Noise for Continuous Speech Recognition Experiments Using AURORA II Database and Tasks”, EuroSpeech '01, Vol. 1, pp. 221-224, 2001    [Non-Patent Document 5]    Rainer Martin, “Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics”, IEEE Trans. on Speech and Audio Processing, Vol. 9, No. 5, July 2001    [Non-Patent Document 6]    ETSI ES 202 050 VI. 1. 1. “Speech Processing, Transmission and Quality aspects (STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithms”, 2002    [Non-Patent Document 7]    Guorong Xuan. Wei Zhang. Peiqi Chai. “EM Algorithms of Gaussian Mixture Model and Hidden Markov Model”, IEEE International Conference on Image Processing ICIP 2001, vol. 1, pp. 145-148, October 2001