In many important applications (e.g., operation of mobile phones or other devices to execute voice commands uttered by a headset user), it is useful to be able to reliably detect the presence or absence of vocal utterances (“own voice” content) of a headset user in the presence of background noise (e.g., to cause a speech recognition engine to start working only if and when the user's own voice is detected). In many important applications, it is also desirable to perform noise reduction on captured own voice content to reduce background noise captured with the own voice content, for example, to improve SNR (signal to noise ratio) and quality of a headset user's own voice signal. For example, such noise reduction may be employed to improve the performance of a speech recognition system in processing captured own voice content or to improve the quality of captured (and typically also transmitted) speech content.
Increasingly, mobile devices such as smart phones, laptops and the like are employing speech recognition engines. Similarly, traditional electronic devices such as household appliances, television remotes, and even automobile control interfaces are employing speech recognition engines. Further, the so-called “Internet of Things” (IoT) promises to create an opportunity to employ speech recognition engines in just about all traditional electronic devices as well as various wired/wireless sensors arrays. As such, there is a need to be able to reliably detect the presence/absence of the user's own voice among background noise, so that a speech recognition engine is employed only if the user's own voice is detected. It is also desirable to suppress background sounds in a speech recognition engine to improve (signal-to-noise) SNR and the quality of an own voice signal, so that the performance of a speech recognition system or result in improved quality of the captured/transmitted speech.
Some conventional own voice extraction headsets use near field microphone array techniques and microphones on the outside of a headset (for example, on the outside of an earplug) to perform noise cancellation. However, this requires a microphone to be placed near the user's mouth (e.g., a boom microphone). This makes the headset design bulky and prone to physical damage.
Some other conventional methods and systems use beamforming techniques, where multiple microphones on the outside of a headset form a beam pattern pointing towards the mouth of the user. However, due to the limited space on a headset (e.g., headphones), only small a microphone array is allowed, and this limits the directivity of the beam pattern and thus the performance of the noise rejection.
Other conventional methods and systems employ a headset microphone array to capture own voice content, but process the output signals of the array in a conventional manner subject to limitations and disadvantages. For example, U.S. Pat. No. 7,773,759, the content of which is incorporated herein by reference in its entirety, describes such a method and system which employs two microphones on a headset to capture own voice content. The method described in this reference employs an internal microphone (in a chamber formed at least in part by the user's ear) and an external microphone to capture the own voice content, and employs the output of the external microphone (indicative of ambient noise as well as own voice content) to compensate for high frequency loss in the own voice content captured by the internal microphone. However, this technique undesirably requires a large gain boost to compensate for the loss at high frequencies of the own voice content captured by the internal microphone, causing significant noise amplification. Also, the technique undesirably requires performance of noise reduction on the external mic signal before it is applied to perform equalization on the internal mic signal, since the external mic signal itself is noisy. Further, the simple, suppression based noise reduction employed is only suitable for reducing stationary background noise (which varies slowly or not at all in comparison with the own voice signal); not other noise (e.g., noise due to a competing talker).
Accordingly, there is a need for methods and systems to improve the processing of outputs from multiple microphones disposed in a headset (e.g., headphones) to improve own voice extraction (in the presence of ambient noise) as well as to perform own voice detection.