Voice activity detectors have long been used to enhance speech in audio signals and for a variety of other purposes including speech recognition or recognition of a particular speaker's voice.
Conventionally, voice activity detectors have relied upon fuzzy rules or heuristics in conjunction with features such as energy levels and zero-crossing rates to make a determination as to whether or not an audio signal includes speech. In some cases, the thresholds employed by conventional voice activity detectors are dependent upon the signal-to-noise ratio (SNR) of an audio signal, making it difficult to choose appropriate thresholds. In addition, while conventional voice activity detectors work well under conditions where an audio signal has a high SNR, they are less reliable when the SNR of the audio signal is low.
Some voice activity detectors have been improved by the use of machine learning techniques, such as neural networks, which typically combine several mediocre voice activity detection (VAD) features to provide a more accurate voice activity estimate. (The term “neural network,” as used herein, may also refer to other machine learning techniques, such as support vector machines, decision trees, logistic regression, statistical classifiers, etc.) While these improved voice activity detectors work well with the audio signals that are used to train them, they are typically less reliable when applied to audio signals that have been obtained from different environments, that include different types of noise or that include a different amount of reverberation than the audio signals that were used to train the voice activity detectors.
A technique known as “feature normalization” has been used to improve the robustness with which a voice activity detector may be used in evaluating audio signals with a variety of different characteristics. In Mean-Variance Normalization (MVN), for example, the means and the variances of each element of the feature vectors are normalized to zero and one, respectively. In addition to improving robustness to different data sets, feature normalization implicitly provides information about how the current time frame compares to previous frames. For example, if an unnormalized feature in a given isolated frame of data has a value of 0.1, that may provide little information about whether this frame corresponds to speech or not, especially if we don't know the SNR. However, if the feature has been normalized based on the long-term statistics of the recording, it provides additional context about how this frame compares to the overall signal.
However, traditional feature normalization techniques such as MVN are typically very sensitive to the percentage of an audio signal that corresponds to speech (i.e., the percentage of time that a person is speaking). If the online speech data during runtime has a significantly different percentage of speech than the data that was used to train the neural network, the mean values of the VAD features will be shifted correspondingly, producing misleading results. Accordingly, improvements are sought in voice activity detection and feature normalization.