Voice activity detection (VAD) is a technique for determining a binary or probabilistic indicator of the presence of voice in a signal containing a mixture of voice and noise. Often the performance of voice activity detection is based on the accuracy of classification or detection. Research work is motivated by the use of voice activity detection algorithms for improving the performance of speech recognition or for controlling the decision to transmit a signal in systems benefiting from an approach to discontinuous transmission. Voice activity detection is also used for controlling signal processing functions such as noise estimation, echo adaption and specific algorithmic tuning such as the filtering of gain coefficients in noise suppression systems.
The output of voice activity detection may be used directly for subsequent control or meta-data, and/or be used to control the nature of audio processing algorithms working on the real time audio signal.
One particular application of interest for voice activity detection is in the area of Transmission Control. For communication systems where an endpoint may cease transmission, or send a reduced data rate signal during periods of voice inactivity, the design and performance of a voice activity detector is critical to the perceived quality of the system. Such a detector must ultimately make a binary decision, and is subject to the fundamental problem that in many features observable on a short time frame, to achieve low latency, there are characteristics of speech and noise that substantially overlap. Hence, such a detector must constantly face a tradeoff between the prevalence of false alarms and the possibility of lost desired speech due to incorrect decisions. The opposing requirements of low latency, sensitivity and specificity have no completely optimal solution, or at least create an operational landscape where the efficiency or optimality of a system is dependent on the application and expected input signal.