The present invention relates generally to signal processing, and more specifically to a method and apparatus for noise suppression in a small array microphone system for use with a speech recognition engine.
In recent years, speech control, speech input and voice activation applications have become increasingly popular in many areas, such as hands-free communication systems, remote controllers, automobile navigation systems and telephone server services. However, current speech recognition technology does not work well under real-world environments where noise and interference degrade the performance of the speech recognition engine. To address this problem, conventional art uses front-end noise suppression processing to enhance the speech signal before inputting it into a speech recognition system. Because one-microphone solutions cannot effectively deal with noise, particularly non-stationary noise such as other voices and music, array microphones are used in the conventional art to improve the performance of speech recognition systems in adverse environments. Array microphones utilize not only temporal and spectral information, but also spatial information to suppress noise and interference to get much cleaner enhanced speech and provide more accurate voice activity detection (VAD) for a speech recognition engine.
FIG. 1 shows a diagram of a conventional array microphone system 100 for speech recognition application. System 100 includes multiple (N) microphones 112a through 112n, which are placed at different positions. The spacing between microphones 112 is required to be at least a minimum distance of D for proper operation. A preferred value for D is half of the wavelength of the band of interest for the signal. Microphones 112a through 112n receive desired speech activity, local ambient noise, and unwanted interference. The N received signals from microphones 112a through 112n are amplified by N amplifiers (AMP) 114a through 114n, respectively. The N amplified signals are further digitized by N analog-to-digital converters (A/Ds or ADCs) 116a through 116n to provide N digitized signals s1(n) through sN(n).
The N received signals, provided by N microphones 112a through 112n placed at different positions, carry information for the differences in the microphone positions. The N digitized signals s1(n) through sN(n) are provided to a beamformer 118 and used to get the single-channel enhanced speech for VAD. The enhanced single-channel VAD signal is supplied to both the adaptive noise suppression filter 120 and the speech recognition engine 122. The adaptive noise suppression filter 120 processes the multi-channel signals s1(n) through sN(n) to reduce the noise component, while boosting the signal-to-noise ratio (SNR) of the desired speech component. This beamforming is used to suppress noise and interference outside of the beam and to enhance the desired speech within the beam. Beamformer 118 may be a fixed beamformer (e.g., a delay-and-sum beamformer) or an adaptive beamformer (e.g., an adaptive sidelobe cancellation beamformer). These various types of beamformer are well known in the art.
The conventional array microphone system 100 for a speech recognition engine is associated with several limitations that curtail its use and/or effectiveness, including (1) it does not provide VAD information for in-beam and out-of-beam signals, (2) the requirement of a minimum distance of D for the spacing between microphones, (3) it does not have a noise suppression control unit to control noise suppression for different situations and based on noise source positions and (4) marginal effectiveness for diffused noise.
Thus, techniques that can more effectively cancel noise for speech recognition systems are highly desirable.