The present invention relates to a speech recognition system, especially a method for eliminating noise by using a microphone array.
These days, resulting from the improved performance of a speech recognition program, speech recognition has been coming into use in many fields. However, when trying to realize speech recognition with high accuracy without imposing a duty to wear a headset type microphone or the like on a speaker, i.e., in an environment of a distance between the microphone and the speaker, cancellation of background noise becomes an important subject. The method for canceling noise by using a microphone array has been considered as one of the most effective means.
FIG. 18 schematically shows a configuration of a conventional speech recognition system using a microphone array.
Referring to FIG. 18, the speech recognition system using the microphone array is provided with a voice input part 181, a sound source localization part 182, a noise suppression part 183, and a speech recognition part 184.
The voice input part 181 is a microphone array constituted of a plurality of microphones.
The sound source localization part 182 assumes a sound source direction (location) based on an input in the voice input part 181. The most often employed system for assuming a sound source direction is a system which assumes, as a sound source coming direction, a maximum peak of a power distribution for each angle where an output power of a delay and sum microphone array is taken on a vertical axis, and a direction for setting directional characteristics is taken on a horizontal axis. To obtain sharper peak, a virtual power called Music Power may be set on the vertical axis. When there are three or more microphones, not only the sound source direction but also a distance can be assumed.
The noise suppression part 183 suppresses noise for the inputted sound based on the sound source direction (location) assumed by the sound source localization part 182 to emphasize a voice. As a method for suppressing noise, normally, one of the following methods is used in many cases.
[Delay and Sum]
This is a method for delaying inputs from the individual microphones in the microphone array by respective delay amounts to sum them up, and thereby setting only voices from a target direction in-phase to reinforce them. By such a delay amount, a direction for setting directional characteristics is decided. A voice from a direction other than the target direction is relatively weakened because of a phase shift.
[Griffiths Jim Method]
This is a method for subtracting “a signal in which a noise component is a main component” from the output by the delay and sum. When there are two microphones, the signal thereof is generated as follows. First, the phases of the one of a combination of signals set in-phase with respect to the target sound source is inversed to be added up with the other, whereby a target voice component is canceled. Then, in the noise section, an adaptive filter is designed so as to minimize noise.
[Method Using Delay and Sum in Combination with 2-Channel Spectral Subtraction]
This is a method for subtracting an output of a sub-beam former outputting mainly a noise component from an output of a main-beam former outputting mainly a voice from the target sound source (Spectral Subtraction) (e.g., see Nonpatent Documents 1, and 2).
[Minimum Variance Method]
This is a method for designing a filter so as to form a directional null of directional characteristics with respect to a directional noise source (e.g., see Nonpatent Document 3).
The speech recognition part 184 carries out speech recognition by generating voice features from the signal having the noise component canceled as much as possible by the noise suppression part 183, and collating patterns for time history of the voice features based on a feature dictionary and time extension.
[Non-Patent Document 1]
Nunoda, Nagata, and Abe: “Voice recognition under unsteady noise using two-channel voice detection”, technical research report 2001-25 by Institute of Electronics, Information and Communication Engineers
[Nonpatent Document 2]
Mizumachi and Akagi: pp. 503-512, “Noise cancellation method by spectral subtraction using microphone pair”, treatise A Vol. J82-A No. 4, 1999 by Institute of Electronics, Information and Communication Engineers”
[Nonpatent Document 3]
Asano, Hayami, Yamada, and Nakamura: “Application of voice emphasis method using sub-spacing method to voice recognition”, technical research report EA97-17 by Institute of Electronics, Information and Communication Engineers”
[Nonpatent Document 4]
Nagata, and Abe: pp. 503-512, “Studies on speaker tracking 2-channel microphone array”, treatise A Vol. J82-A No. 4 by Institute of Electronics, Information and Communication engineers”
As described above, in the speech recognition technology, when realizing speech recognition with high accuracy in an environment of a distance between the microphone and the speaker, cancellation of background noise becomes an important task. The method for assuming the sound source direction by using the microphone array to cancel noise is considered as one of the most effective means.
However, to enhance noise suppression performance by the microphone array, a large number of microphones is generally needed, which in turn necessitates special hardware to execute simultaneous multichannel inputs. On the other hand, if the microphone array is constituted by a small number of microphones (e.g., 2-channel stereo input), a beams of directional characteristics of the microphone array is gently spread to be prevented from being sufficiently focused on the target sound source. Consequently, an incursion rate of noise from the surroundings is high.
Thus, in order to enhance the performance of speech recognition, a certain processing such as estimation and subtraction of an arriving noise component to be mixed is necessary. However, in the above-described noise suppression methods (delay and sum, minimum variance method, and the like), no functions have been available to estimate and actively subtract the mixed noise component.
In addition, the method for using the delay and sum in combination with the 2-channel spectral subtraction, since the noise component is estimated for the cancellation, can suppress the background noise to a certain extent. However, since the noise is estimated by “a point,” an accuracy of the estimation has not always been high.
On the other hand, as problems resulting with small-scale microphone array (becoming conspicuous especially in 2-channel stereo input), there is an aliasing problem, in which assumption accuracy of a noise component is reduced at a specific frequency corresponding to a noise source direction.
As measures to suppress the effects of such aliasing, a method for narrowing spacing between microphones, and a method for arranging the microphone in an inclined state are conceivable (e.g., see Nonpatent Document 4).
However, if the microphone spacing is narrowed, directional characteristics around a lower frequency domain may be deteriorated, and accuracy of speaker direction identification may be reduced. Consequently, in the beam former such as 2-channel spectral subtraction, the microphone spacing cannot be narrowed beyond a given level, and there is a limit to the capability of suppressing the effects of aliasing.
In terms of the method for arranging the microphone in the inclined state, in the two microphones, by providing a sensitivity difference in sound waves from an oblique direction, a sound wave can be made different in gain balance from a sound wave from the front. However, because of only a small sensitivity difference in the normal microphone, even in the case of this method, there is a limit to the capability of suppressing the effects of aliasing.