In speech coding systems used for conversational speech it is common to use discontinuous transmission (DTX) to increase the efficiency of the encoding (reduce the bit rate). The reason is that conversational speech contains large amounts of pauses embedded in the speech, e.g. while one person is talking the other one is listening. So with discontinuous transmission (DTX) the speech encoder is only active about 50 percent of the time on average and the rest is encoded using comfort noise. One example of a codec that can be used in DTX mode is the AMR codec, described in reference [1].
For important quality DTX operation, i.e. without degraded speech quality, it is important to detect the periods of speech in the input signal which is done by the Voice Activity Detector (VAD). With increasing use of rich media it is also important that the VAD detects music signals so that they are not replaced by comfort noise since this has a negative effect on the end user quality. FIG. 1 shows an overview block diagram of a generalized VAD according to prior art, which takes the input signal (divided into data frames, 10-30 ms depending on the implementation) as input and produces VAD decisions as output (one decision for each frame).
FIG. 1 illustrates the major functions of a generalized prior art VAD 10 which consists of: a feature extractor 11, a background estimator 12, a primary voice detector 13, a hangover addition block 14, and an operation controller 15. While different VAD use different features and strategies for estimation of the background, the basic operation is still the same.
The primary decision “vad_prim” is made by the primary voice detector 13 and is basically only a comparison of the feature for the current frame (extracted in the feature extractor 11), and the background feature (estimated from previous input frames in the background estimator 12). A difference larger than a threshold causes an active primary decision “vad_prim”. The hangover addition block 14 is used to extend the primary decision based on past primary decisions to form the final decision “vad_flag”. This is mainly done to reduce/remove the risk of mid speech and back end clipping of speech bursts. However, it is also used to avoid clipping in music passages, as described in references [1], [2] and [3]. As indicated in FIG. 1, an operation controller 15 may adjust the threshold for the primary detector 13 and the length of the hangover addition according to the characteristics of the input signal.
As indicated in FIG. 1, another important functional part of the VAD 10 is the estimation of the background feature in the background estimator 12. This may be done by two basically different principles, either by using the primary decision “vad_prim”, i.e. with decision feed-back; or by using some other characteristics of the input signal, i.e. without decision feed-back. To some degree it is also possible to combine the two principals.
Below is a brief description of different VAD's and there related problem.
AMR VAD1
The AMR VAD1 is described in TS26.094, reference [1], and variation are described in reference [2].
Summary of basic operation, for more details see reference [1].    Feature: Summing of subband SNRs    Background: Background estimate adaptation based on previous decisions    Control: Threshold adaptation based on input noise level    Other: Deadlock recovery analysis for step increases in noise level based on stationarity estimation. High frequency correlation to detect music/complex signals and allow for extended hangover for such signals.
The major problem with this solution is that for some complex backgrounds (e.g. babble and especially for high input levels) causes a significant amount of excessive activity. The result is a drop in the DTX efficiency gain, and the associated system performance.
The use of decision feedback for background estimation also makes it difficult to change detector sensitivity. Since, even small changes in the sensitivity will have an effect on background estimation which may have a significant effect on future activity decisions. While it is the threshold adaptation based on input noise level that causes the level sensitivity it is desirable to keep the adaptation since it improves performance for detecting speech in low SNR stationary noise.
While the solution also includes a music detector which works for most of the cases, it has been identified music segments which are missed by the detector and therefore cause significant degradation of the subjective quality of the decoded (music) signal, i.e. segments are replaced by comfort noise.
EVRC VAD
The EVRC VAD is described in references [4] and [5] as EVRC RDA.
The main technologies used are:    Feature: Split band analysis, (with worst case band is used for rate selection in a variable rate speech codec.    Background: Decision based increase with instant drop to input level.    Control: Adaptive Noise hangover addition principle is used to reduce primary detector mistakes. Hong et al describes noise hangover adaptation in reference [6].
Existing split band solution EVRC VAD has occasional bad decisions which reduced the reliability of detecting speech and shows a too low frequency resolution which affects the reliability to detect music.
Voice Activity Detection by Freeman/Barret
Freeman, see reference [7], discloses a VAD Detector with independent noise spectrum estimation.
Barrett, see reference [8], discloses a tone detector mechanism that does not mistakenly characterize low frequency car noise for signaling tones.
Existing solutions based on Freeman/Barret occasionally show too low sensitivity (e.g. for background music).
AMR VAD2
The AMR VAD2 is described in TS26.094, reference [1].    Technology:    Feature: Summing of FFT based subband SNRs detector    Background: Background estimate adaptation based on previous decisions    Control: Threshold adaptation based on input signal level and adaptive noise hangover.
As this solution is similar to the AMR VAD1 they also share the same type of problems.