1. Field of the Invention
The present invention relates generally to speech processing in communications systems. More specifically, systems and methods for adaptive sidetone and adaptive voice activity detect (VAD) threshold for speech processing are disclosed.
2. Description of Related Art
Modern communication systems greatly rely on digital speech processing in order to provide efficient systems. Examples of such communication systems are digital telephony trunks, voice mail, voice annotation, answering machines, digital voice over data links and the like. Such speech processing systems often incorporate a voice activity detect (VAD) function, also referred to as a signal classifier. The VAD determines when the user is speaking and when the user is silent. The output of the VAD, also known as a voicing decision, is binary. The voicing decision may be used to control, for example, when to measure the level of background noise, when to suppress sending speech packets across a wireless medium (silence suppression), when to adapt a speech filter or speech beamformer to the user's speech, or when to adapt a noise filter or noise beamformer to the background noise.
A VAD threshold is used to determine whether speech is present and is a critical parameter for the proper operation of these speech processing systems implementing VAD. The VAD threshold may be a single fixed value for all levels of noise that is used to compare to a running average of short term integrated energy in the input signal over some integration interval, usually a few milliseconds to hundreds of milliseconds. The VAD threshold may also be adapted to the noise level as measured over a long interval, such as ten to hundreds of seconds. More complex solutions use a VAD vector of thresholds that is used to compare to short term energy in several audio frequency sub-bands and then sum them together in some weighted manner where the weights reflect the relative importance of each of the sub-bands.
However, one problem with such VAD thresholds is that a fixed value is not optimal for all levels of ambient noise that may surround the speaker, particularly when the noise level is high. Normal speech may include as much as 60% of silence on average in a two-way conversation. During the periods of silence, the microphone or other speech input device picks up the environment or background noise. The noise characteristics and level may vary significantly, for example, from those of a quiet room to those of a noisy street. If the VAD threshold is too low, then the VAD will suffer a high level of false positive errors in a high ambient noise situation. If the threshold is too high, then the VAD will report a high level of false negative errors when the speaker is in a quiet environment.
In addition, in a high noise environment, the speech to noise ratio is so low that even if the VAD threshold is set to the optimal point, the VAD algorithm suffers enough errors that the threshold adaptation often adapts to the speaker's voice or does not have a chance to adapt to the unvoiced noise. This tends to draw the threshold away from the optimal point, which can further reduce the VAD accuracy.
Thus it would be desirable to provide an improved VAD system with lower false positive and negative rates in high noise environments.