Signal detection in continuous or discrete time is a cornerstone problem in signal processing. One particularly well studied instance in speech and acoustic processing is voice detection, which subsumes a solution to the problem of distinguishing the most likely hypothesis between one assuming speech presence and a second assuming the presence of noise. Furthermore, when multiple people are speaking, it is difficult to determine if the captured audio signal is from a speaker of interest or from other people. Speech coding, speech/signal processing in noisy conditions, and speech recognition are important applications where a good voice/signal detection algorithm can substantially increase the performance of the respective system.
Traditionally, voice detection approaches used energy criteria such as short-time SNR estimation based on long-term noise estimation as for instance described in “[22] Srinivasan K, Gersho A (1993) Voice activity detection for cellular networks. In: IEEE Speech Coding Workshop, pp 85-86”, likelihood ratio test of the signal and exploiting a statistical model of the signal as described in “Cho Y, Al-Naimi K, Kondoz A (2001) Improved voice activity detection based on a smoothed statistical likelihood ratio. In: International Conference on Acoustics, Speech and Signal Processing, IEEE, Los Alamitos, Calif., vol 2, pp 737-740”, or attempted to extract robust features (e.g., the presence of a pitch as described in “[9] ETSI (1999) Digital cellular telecommunication system (phase 2+); voice activity detector VAD for adaptive multi rate (AMR) speech traffic channels; general description. Tech. Rep. V.7.0.0, ETSI”, the formant shape as described in “[15] Hoyt J D, Wechsler H (1994) Detection of human speech in structured noise. In: International Conference on Acoustics, Speech and Signal Processing, IEEE, vol 2, pp 237-240”, or the cepstrum as described in “[13] Haigh J, Mason J (1993) Robust voice activity detection using cepstral features. In: IEEE Region 10 Conference TENCON, IEEE, vol 3, pp 321-324”) and compare them to a speech model. Diffuse, non-stationary noise, with a time-varying spectral coherence, plus the presence of a superposition of spatially localized but simultaneous sources make this problem extremely challenging when using a single sensor (microphone).
Not surprisingly, during the last decade researchers have focused on multi-modality sensing to make this problem tractable. Multiple channel voice detection algorithms take advantage of the extra information provided by additional sensors. For example in “[21] Rosca J, Balan R, Fan N, Beaugeant C, Gilg V (2002) Multichannel voice detection in adverse environments. In: European Signal Processing Conference” the mixing model is blindly identified and a signal is estimated with maximal signal-to-interference-ratio (SIR) obtainable through linear filtering. Although the filtered signal contains large artifacts and is unsuitable for signal estimation it was proven ideal for signal detection. Another example is the WITTY (Who is Talking to You) project from Microsoft as described in “[24] Zhang Z, Liu Z, Sinclair M, Acero A, Deng L, Huang X, Zheng Y (2004) Multisensory microphones for robust speech detection, enhancement and recognition. In: International Conference on Acoustics, Speech and Signal Processing, IEEE, pp 781-784”, which deals with the voice detection problem by means of integrated heterogeneous sensors (e.g., a combination of a close-talk microphone and a bone conductive microphone).
Even further, multi-modal systems using both microphones and cameras have been studied as described in “[17] Liu P, Wang Z (2004) Voice activity detection using visual information. In: International Conference on Acoustics, Speech and Signal Processing, Montreal, Canada, vol 1, pp 609-612”.
Improved and novel methods and systems to perform voice (or signal) detection for the source of interest with the reliability of multi-modal approaches such as WITTY but in the absence of additional sensors such as a bone conducting microphone are beneficial but are believed to be currently not available and are required.