Robust low-power speech/non-speech detection performed on-the-fly provides important information for further processing of an input audio signal. As the name suggests, speech/non-speech detection categorizes received audio input as speech or non-speech. Applications for such technology include speech detection for always listening devices, accuracy improvements for audio preprocessing, beam forming, and text-independent speaker identification. For example text-independent speaker identification (SID) systems have improved accuracy when analysis is based only on real speech signals while silence and noise segments are removed. Furthermore, for text-dependent SID, speech detection may be performed by wake on voice in low power systems.
Current speech/non-speech detection may rely on sample-based voice activity detection that rely on audio signal characteristics such as short term energy of the signal and zero crossing rates. However, such detection systems are not accurate and have high false positive and false negative rates. Other techniques include frequency-based voice activity detection that provide frequency domain analysis (e.g., after application of a fast Fourier transform) of energy in certain frequency bands. However, such techniques have similar limitations of low accuracy.
As such, existing techniques do not provide high quality low resource speech/non-speech classification. Such problems may become critical as the desire to implement wake on voice, always listening devices, and the like becomes more widespread.