An important problem in many areas of speech processing is the determination of active speech periods within a given audio signal. Speech can be characterized as a discontinuous signal since information is carried only when someone is talking. The regions where voice information exists are referred to as voice-active segments and the pauses between talking are called voice-inactive or silence segments. The task of determining which class an audio segment belongs to is generally approached as a statistical hypothesis problem where a decision is made based on an observation vector, commonly referred to as a feature vector. One or many different features may serve as the input to a decision rule that assigns the audio segment to one of the two given classes. It is effectively a binary decision problem where performance trade-offs are made trying to maximize the detection rate of active speech while minimizing the false detection rate of inactive segments. But generating an accurate indication of the presence of speech, or lack there of, is generally difficult especially when the speech signal is corrupted by background noise or unwanted interference.
In the art, an algorithm employed to detect the presence or absence of speech is referred to as a voice activity detector (VAD). Many speech-based applications require VAD capability in order to operate properly. For example in speech coding, the purpose is to encode raw audio such that the overall transferred data rate is reduced. Since information is only carried when someone is talking, clearly knowing when this occurs can greatly aid in data reduction. The more accurate the VAD the more efficient a speech coder algorithm can operate. Another example is speech recognition. In this case, a clear indication of active speech periods is critical. False detection of active speech periods will have a direct degradation effect on the recognition algorithm. VAD is an integral part to many speech processing systems. Other examples include audio conferencing, echo cancellation, VoIP (voice over IP), cellular radio systems (GSM and CDMA based) and hands-free telephony.
Many different techniques have been applied to the art of VAD. It is not uncommon for an algorithm to utilize a feature vector consisting of such features as full-band energy, sub-band energies, zero-crossing rate, cepstral coefficients, LPC (linear predictive coding) distance measures, pitch or spectral shape. Most have adaptive thresholds. Some algorithms require training periods to adapt to the environment or the actual speaker. Noise reduction techniques, such as wiener filtering or spectral subtraction, are sometimes employed to improve the detection performance. Other less common approaches that utilize HMMs (hidden Markov models), wavelet transforms, and fuzzy logic, have been studied and reported in the literature. Some algorithms are more successful then others, depending on the criteria. But in general, none will ever be a perfect solution to all applications because of the variety and varying nature of natural human speech and background noise.
Since it is an inexact science, like many areas in speech processing, attempts have been made over the years to propose standardized algorithms for communication purposes. The International Telecommunication Union-Telecommunication Standardization Sector (ITU-T) is the govern body for proposed VAD standards. These standardized algorithms are generally proposed to accompany certain communication protocol standards, such as GSM for example. For further study on VAD algorithms and a useful comparison matrix between different methods please see, “Digital Speech”, A. Kondoz, 2004 John Wiley & Sons, Ltd, pages 357-377.
The disadvantage with current VAD algorithms is that they generally require feedback knowledge of the detector state to determine when to run background noise adaptation. Adaptive thresholds are meant to track the noise and thus must update only when someone is not talking. A false detect can cause the algorithm to be stuck on or worst-case be stuck off. A reset mechanism is usually included to clear the state after a certain timeout period is exceeded. Another issue is that most algorithms work well only at higher SNR (signal to noise ratio) and these approaches generally include techniques for noise reduction to improve performance. But these methods are not very effective in the presence of non-Gaussian non-stationary background noise. Another issue is that most techniques with better than average performance require significant processing in order to transform the input audio into the multi-feature vector usually required by the algorithm. This limits the use of many good algorithms to only non-real time applications or to systems that can afford the extra processing burden.