Speech signal processing is an important issue in the context of present communication systems, for example, hands-free telephony and speech recognition and control by speech dialog systems, speech recognition means, etc. When audio signals that may or may not comprise speech at a given time frame are to be processed in the context of speech signal processing detection of speech is an essential step in the overall signal processing.
In the art of multi-channel speech signal processing, the determination of signal coherence of two or more signals detected by spaced apart microphones is commonly used for speech detection. Whereas speech represents a rather time-varying phenomenon due to the temporarily constant transfer functions that couple the speech inputs to the microphone channels spatial coherence for sound, in particular, a speech signal, detected by microphones located at different positions can, in principle, be determined. In the case of multiple microphones for each pair of microphones signal coherence can be determined and mapped to a numerical range from, 0 (no coherence) to 1 (maximum coherence), for example. While diffuse background noise exhibits almost no coherence a speech signal generated by a speaker usually exhibits a coherence close to 1.
However, in reverberating environments wherein a plurality of sound reflections are present, e.g., in a vehicular cabin, reliable estimation of signal coherence still poses a demanding problem. Due to the acoustic reflections the transfer functions describing the sound transfer from the mouth of a speaker to the microphones show a large number of nulls in the vicinity of which the phases of the transfer functions may discontinuously change. However, a consistent phase relation of the input signals of the microphones is crucial for the determination of signal coherence. If within a frequency band, wherein a relatively coarse spectral resolution of some 30 to 50 Hz is usually employed, a null is present, the phase in the same band may assume very different phase values.
Thus, in reality the phase relation of wanted signal portions of the microphone signals largely depends on the spectra of the input signals which is in marked contrast to the technical approach of estimating signal coherence by determining normalized signal correlations independently from the corresponding signal spectra. The usually employed coarse spectral resolution of some 30 to 50 Hz per frequency band, therefore, often causes relatively small coherence values even if speech is present in the audio signals under consideration and, thus, failure of speech detection, since background noise, e.g., driving noise in an automobile, gives raise to some finite “background coherence” that is comparable to small coherence values caused by the poor spectral resolution.
In the art, some temporal smoothing of the power of the detected signals by means of constant smoothing parameters is performed in an attempt to improve the reliability of speech detection based on signal coherence. However, conventional smoothing processing results in the suppression of fast temporal changes of the estimated coherence and, thus, unacceptable long reaction times during speech onsets and offsets or misdetection of speech during actual speech pauses.
Therefore, there is a need for an enhanced estimation of signal coherence, in particular, for the detection of speech in highly time-varying audio signals showing fast reaction times and robustness during speech pauses.