Because the performance of a speech recognition system is remarkably deteriorated by the influence of noise, a method for providing noise-resistance is needed in actual operation of the speech recognition system. The cause of the performance deterioration of a speech recognition system owing to noise is that an input signal used in actual operation does not coincide with voice data used at the time of learning an acoustic model. For the purpose of suppressing this disagreement between the data, if roughly classified, two methods exist as means for providing noise-resistance for the use in speech recognition. One is a method of approximating a distribution produced by an input signal to an acoustic model, by the use of suppression of noise components in the input signal or removal of noise components included in the input signal. Hereafter, this method is referred to as a noise suppression method. The other is a method of approximating an acoustic model to a distribution produced by an input signal, by the use of adaptation of the acoustic model to the same noise environment as that the input signal. Hereafter, this method is referred to as an acoustic model adaptation method.
A noise suppression device described in Patent Document 1 comprises a spectrum transformation means, an S/N estimation means, a suppression coefficient data table, a suppression amount estimation means and a noise suppression means. Then, the noise suppression device operates as follows. The spectrum transformation means transforms an input voice signal including noise from the time domain to the frequency domain. Then, on the basis of the output transformed by the use of the spectrum transformation means, the S/N estimation means estimates an S/N ratio (signal-noise ratio) of the input voice signal. The suppression coefficient data table stores S/N ratio values, frequency components and predetermined values of a suppression coefficient α, in a manner to relate them to each other. From the suppression coefficient data table, the suppression amount estimation means extracts a value of the suppression coefficient α corresponding to the S/N ratio estimated by the S/N estimation means. Then, on the basis of the extracted value of the suppression coefficient α, the noise suppression means suppresses a noise component included in the output transformed by the use of the spectrum transformation means
A speech recognition device described in Patent Document 2 suppresses a high noise with a large suppression amount on an input voice signal and detects a voice interval and a noise interval from the input signal suppressed the high noise. Then, the speech recognition device also suppresses a low noise with a low suppression amount on the input signal and generates a noise model from the signal of the part specified by the above-mentioned noise interval within the signal suppressed this low noise. The speech recognition device synthesizes this noise model with a clean voice model. Using the model synthesized as above, the speech recognition device recognizes a voice with respect to the signal of the part specified by the above-mentioned voice interval within the signal suppressed the low noise.
A speech recognition device described in Patent Document 3 suppresses an echo signal included in an input signal on the basis of a signal supplied to a speaker, and, further, suppresses a background noise of the surroundings from the input signal. Then, on the basis of the noise-suppressed signal, the speech recognition device determines a voice interval and a noise interval. Then, on the basis of the signal determined to be a noise interval, the speech recognition device learns a noise model and, by synthesizing the noise model with a clean voice model, generates a noise-superposed voice model. On the basis of the above-mentioned signal determined to be a voice interval and the noise-superposed voice model, the speech recognition device recognizes a voice.
A speech recognition device described in Patent Document 4 stores waveform signal data on a plurality of types of ambient noises for training. Then, from the waveform signal data on ambient noises, the speech recognition device generates a Gaussian mixture model which is a plurality of mixtures in one state in a manner to maximize an output likelihood. Then, from a predetermined Hidden Markov Model (HMM) without noise and the above-mentioned Gaussian mixture model, the speech recognition device generates an acoustic model. This acoustic model satisfies the following conditions. Firstly, in every combined state of individual states, the acoustic model includes a mixed Gaussian distribution for each state which is represented by the sum of linear couplings of individual Gaussian distributions weighted by respective predetermined weighting coefficients. Secondly, this acoustic model is generated on the basis of a Hidden Markov Model in which the mixture weightings of the above-mentioned Gaussian mixture model is adapted by the use of environmental voice data at the time of speech recognition.
Here, also described below are Patent Document 5 and Non-patent Document 1 which will be used in the section “EXEMPLARY EMBODIMENT OF THE INVENTION”.    [Patent Document 1] Japanese Patent Application Laid-Open No. 2000-330597    [Patent Document 2] Japanese Patent Application Laid-Open No. 2005-321539    [Patent Document 3] Japanese Patent Application Laid-Open No. 2006-3617    [Patent Document 4] Japanese Patent Application Laid-Open No. 2003-177781    [Patent Document 5] Japanese Patent Publication No. 4282227    [Non-patent Document 1] Hiroshi Matsumoto, “Speech Recognition Techniques for Noisy Environments” The second Forum on Information Technology (FIT2003), pp. 1-4, September 2003.