In speech coding systems used for conversational speech it is common to use discontinuous transmission (DTX) to increase the efficiency of the encoding. The reason is that conversational speech contains large amounts of pauses embedded in the speech, e.g. while one person is talking the other one is listening. So with DTX the speech encoder is only active about 50 percent of the time on average and the rest can be encoded using comfort noise. Some example codecs that have this feature are the AMR NB (Adaptive MultiRate Narrowband).
For high quality DTX operation, i.e. without degraded speech quality, it is important to detect the periods of speech in the input signal this is done by the Voice Activity Detector (VAD). FIG. 1 shows an overview block diagram of a generalized VAD 180, which takes the input signal 100, divided into data frames, 5-30 ms depending on the implementation, as input and produces VAD decisions as output 160. I.e. a VAD decision 160 is a decision for each frame whether the frame contains speech or noise).
The generic VAD 180 comprises a background estimator 130 which provides subband energy estimates and a feature extractor 120 providing the feature subband energy. For each frame, the generic VAD calculates features and to identify active frames the feature(s) for the current frame are compared with an estimate of how the feature “looks” for the background signal.
The primary decision, “vad_prim” 150, is made by a primary voice activity detector 140 and is basically just a comparison of the features for the current frame and the background features (estimated from previous input frames), where a difference larger than a threshold causes an active primary decision. The hangover addition block 170 is used to extend the VAD decision from the primary VAD based on past primary decisions to form the final VAD decision, “vad_flag” 160, i.e. older VAD decisions are also taken into account. The reason for using hangover is mainly to reduce/remove the risk of mid speech and backend clipping of speech bursts. However, the hangover can also be used to avoid clipping in music passages. An operation controller 110 may adjust the threshold(s) for the primary detector and the length of the hangover addition according to the characteristics of the input signal.
There are a number of different features that can be used for VAD detection, one feature is to look just at the frame energy and compare this with a threshold to decide if the frame comprises speech or not. This scheme works reasonably well for conditions where the SNR is good but not for low SNR cases. In low SNR it is instead required to use other metrics comparing the characteristics of the speech and noise signals. For real-time implementations an additional requirement of VAD functionality is computational complexity and this is reflected in the frequent representation of subband SNR VADs in standard codecs e.g. AMR NB, AMR WB (Adaptive Multi-Rate WideBand) and G.718 (ITU-T recommendation embedded scalable speech and audio codec).
While the subband SNR based VAD combines the SNR's of the different subbands to a metric which is compared to a threshold for the primary decision. In the subband based VAD, the SNR is determined for each subband and a combined SNR is determined based on those SNRs. The combined SNR, may be a sum of all SNRs on different subbands. There are also known solutions where multiple features with different characteristics are used for the primary decision. However, in both cases there is just one primary decision that is used for adding hangover, which may be adaptive to the input signal conditions, to form the final decision. Also many VAD's have an input energy threshold for silence detection, i.e. for input levels that are low enough, the primary decision is forced to the inactive state.
For VADs based on subband SNR principle it has been shown that the introduction of a non-linearity in the subband SNR calculation, called significance thresholds, can improve VAD performance for conditions with non-stationary noise (babble, office). Non-stationary noise can be difficult for all VADs, especially under low SNR conditions, which results in a higher VAD activity compared to the actual speech and reduced capacity from a system perspective. Of the non-stationary noise the most difficult is babble noise and the reason is that its characteristics are relatively close to the speech signal the VAD is designed to detect. Babble noise is usually characterized both by the SNR relative to the speech level of the foreground speaker and the number of background talkers, where a common definition (as used in subjective evaluations) is that babble should have 40 or more background speakers, the basic motivation being that for babble it should not be possible to follow any of the included speakers in the babble noise (non of the babble speakers shall become intelligible). It should also be noted that with an increasing number of talkers in the babble noise it becomes more stationary. With only one (or a few) speaker(s) in the background they are usually called interfering talker(s). A further problematic issue is that babble noise may have spectral variation characteristics very similar to some music pieces that the VAD algorithm shall not suppress.
In the previously mentioned VAD solutions AMR NB/WB and G.718 there are varying degrees of problem with babble noise in some cases already at reasonable SNRs (20 dB). The result is that the assumed capacity gain from using DTX can not be realized. In real mobile phone systems it has also been noted that it may not be enough to require reasonable DTX operation in 15-20 dB SNR. If possible one would desire reasonable DTX operation down to 5 dB even 0 dB depending on the noise type. For low frequency background noise an SNR gain of 10-15 dB can be achieved for the VAD functionality just by highpass filtering the signal before VAD analysis. Due to the similarity of babble to speech the gain from highpass filtering the input signal is very low.
From a quality point of view it is better to use a failsafe VAD, meaning that when in doubt it is better for the VAD to signal speech input and just allow for a large amount of extra activity. This may, from a system capacity point view, be acceptable as long as only a few of the users are in situations with non-stationary background noise. However, with an increasing number of users in non-stationary environments the usage of failsafe VAD may cause significant loss of system capacity. It is therefore becoming important to work on pushing the boundary between failsafe and normal VAD operation so that a larger class of non-stationary environments are handled using normal VAD operation.
Though the usage of significance thresholds which improves VAD performance it has been noted that it may also cause occasional speech clippings, mainly front end clippings of low SNR unvoiced sounds.
For existing solutions when a new problem area is identified it can be difficult to find a new tuning of an existing VAD that does not change the behavior of the VAD for already working conditions. That is, while it would be possible to change the tuning to cope with the new problem, it may not be possible to make the tuning without changing the behavior in already known conditions.