Voice-activity detection (VAD) is used to detect a voice signal in a signal that has unknown characteristics. Numerous VAD devices are known in the art. They tend to follow a common paradigm comprising a pre-processing stage, a feature-extraction stage, a thresholds comparison stage, and an output-decision stage.
The pre-processing stage places the input audio signal into a form that better facilitates feature extraction. The feature-extraction stage differs widely from algorithm to algorithm, but commonly-used features include (1) energy, either full-band, multi-band, low-pass, or high-pass, (2) zero crossings, (3) the frequency-domain shape of the signal, (4) periodicity measures, and (5) statistics of the speech and background noise. The thresholds comparison stage then uses the selected features and various thresholds of their values to determine if speech is present in or absent from the input audio signal. This usually involves use of some “hold-over” algorithm, or “on”-time minimum threshold, to ensure that detection of either presence or absence of speech lasts for at least a minimum period of time and does not oscillate on-and-off.
Some known VAD methods require a measurement of the background noise a-priori in order to set the thresholds for later comparisons. These algorithms fail when the acoustics environment changes over time. Hence, these algorithms are not particularly robust. Other known VAD methods are automatic and do not require a-priori measurement of background noise. These tend to work better in changing acoustic environments. However, they can fail when background noise has a large energy and/or the characteristics of the noise are similar to those of speech. (For example, the G.729 VAD algorithm incorrectly generates “speech detected” output when the input audio signal is a keyboard sound.) Hence, these algorithms are not particularly robust either.