Along with the development of computer technology, the recognition accuracy of speech recognition has rapidly been improving. In an in-vehicle car navigation system, a television conference system, a digital signage system, or the like equipped with speech recognition technology, an “out-of-context” error of erroneously detecting a noise as a speech occurs in a noisy environment. A technique is therefore desired which suppresses the out-of-context error in an environment with many noises.
For example, as a technique of performing highly noise-resistant speech detection independent of the number of phonemes in an audio signal, there is an example using an acoustic feature quantity of an input signal. The method is a technique of comparing an extracted acoustic feature quantity with a previously stored acoustic feature quantity of a noise signal, and determining the input signal as noise if the acoustic feature quantity of the input signal is close to the stored acoustic feature quantity of the noise signal.
According to another technique, sound signals in frame units of sound data are converted into a spectrum, and a spectrum envelope is calculated from the spectrum. There is also an example of audio signal processing of suppressing a detected peak in the spectrum having the spectrum envelope removed therefrom. With the removal of the spectrum envelope, a sharp peak with a narrow bandwidth in non-stationary noise, such as electronic sound and siren sound, is detected and suppressed even in an environment in which stationary noise having a gentle peak with a wide bandwidth, such as engine sound and air conditioner sound, is generated. Further, there is an example of determining the arrival direction of sound with the use of audio signals obtained by a plurality of microphones on the basis of the correlation between the signals from the microphones, and suppressing sounds other than the sound arriving from the direction of a speaking person. Furthermore, there is an example of calculating a noise reduction coefficient for reducing noise on the basis of an audio signal, and reducing noise in the audio signal on the basis of the noise reduction coefficient and the original audio signal. The above-described related-art techniques are disclosed in, for example, Japanese Laid-open Patent Publication Nos. 10-97269, 2008-76676, 2010-124370, and 2007-183306, and Matsuo Naoshi et al., “Speech Input Interface with Microphone Array,” FUJITSU, Vol. 49, No. 1, pages 80 to 84, January 1998.