1. Field of the Invention
The invention relates to a method for digital speech signal enhancement using signal processing algorithms and acoustic models for target speakers. The invention further relates to speech enhancement using microphone array signal processing and speaker recognition.
2. Description of the Prior Arts
Speech/voice plays an important role in the interaction between human and human, and human and machine. However, the omnipresent environmental noise and interferences may significantly degrade the quality of captured speech signal by a microphone. Some applications, e.g. the automatic speech recognition (ASR) and speaker verification, are especially vulnerable to such environmental noise and interferences. A hearing impaired human also suffers from the degradation of speech quality. Although a person with normal hearing can tolerate considerable noise and interferences in the captured speech signal, listener fatigue easily arises with exposure to low signal to noise ratio (SNR) speech.
It is not uncommon to find more than one microphones on many devices, e.g. a smartphone, a tablet, or a laptop computer. An array of microphone can be used to boost the speech quality by means of beamforming, blind source separation (BSS), independent component analysis (ICA), and many other proper signal processing algorithms. However, there may be several speech sources in the acoustic environment where the microphone array is deployed, and these signal processing algorithms themselves cannot decide which source signal should be kept and which one should be suppressed along with the noise and interferences. Conventionally, a linear array is used, and sound wave of a desired source is assumed to impinge on the array either from the central direction, or from either end of the array, hence correspondingly, a broadside beamforming or an endfire beamforming is used to enhance the desired speech signal. Such a conventional way, at least to some extent, limits the utility of a microphone array. An alternative choice is to extract a speech signal from the audio mixtures recorded by microphone array that best matches a predefined speaker model or speaker profile. This solution is most attractive when the target speaker is predictable or known in advance. For example, the most likely target speaker of a personal device like a smartphone might be the device owner. Once a speaker profile for a device owner is created, the device can always focus on its owner's voice, and treats other voices as interferences, except when it is explicitly set not to behave in this way.