In speech coding systems used for conversational speech it is common to use discontinuous transmission (DTX) to increase the efficiency of the encoding. The reason is that conversational speech contains large amounts of pauses embedded in the speech, e.g. while one person is talking the other one is listening. So with DTX the speech encoder is only active about 50 percent of the time on average and the rest can be encoded using comfort noise. Comfort noise is an artificial noise generated in the decoder side and only resembles the characteristics of the noise on the encoder side and therefore requires less bandwidth. Some example codecs that have this feature are the AMR NB (Adaptive Multi-Rate Narrowband) and EVRC (Enhanced Variable Rate CODEC). Note AMR NB uses DTX and EVRC uses variable rate (VBR), where a Rate Determination Algorithm (RDA) decides which data rate to use for each frame, based on a VAD (voice activity detection) decision.
For high quality DTX operation, i.e. without degraded speech quality, it is important to detect the periods of speech in the input signal this is done by the Voice Activity Detector (VAD), which is used in both for DTX and RDA. It should be noted that speech is also referred to as voice. FIG. 1 shows an overview block diagram of a generalized VAD 180, which takes the input signal 100, divided into data frames, 5-30 ms depending on the implementation, as input and produces VAD decisions as output 160. I.e. a VAD decision 160 is a decision for each frame whether the frame contains speech or noise).
The generic VAD 180 comprises a background estimator 130 which provides sub-band energy estimates and a feature extractor 120 providing the feature sub-band energy. For each frame, the generic VAD 180 calculates features and to identify active frames the feature(s) for the current frame are compared with an estimate of how the feature “looks” for the background signal.
A primary decision, “vad_prim” 150, is made by a primary voice activity detector 140 and is basically just a comparison of the features for the current frame and the background features estimated from previous input frames, where a difference larger than a threshold causes an active primary decision. A hangover addition 170 is used to extend the primary decision based on past primary decisions to form the final decision, “vad_flag” 160. The reason for using hangover is mainly to reduce/remove the risk of mid speech and backend clipping of speech bursts. However, the hangover can also be used to avoid clipping in music passages. An operation controller 110 may adjust the threshold(s) for the primary detector and the length of the hangover according to the characteristics of the input signal.
There are a number of different features that can be used for VAD detection. The most basic feature is to look just at the frame energy and compare this with a threshold to decide if the frame is speech or not. This scheme works reasonably well for conditions where the SNR is high but not for low SNR, (signal-to-noise ratio) cases. In low SNR cases other metrics comparing the characteristics of the speech and noise signals must be used instead. For real-time implementations an additional requirement on VAD functionality is computational complexity and this is reflected in the frequent representation of subband SNR VADs in standard codecs, e.g. AMR NB, AMR WB (Adaptive Multi-Rate Wideband), EVRC, and G.718 (ITU-T recommendation embedded scalable speech and audio codec). These example codecs also use threshold adaptation in various forms. In general background and speech level estimates, which also are used for SNR estimation, can be based on decision feedback or an independent secondary VAD for the update. In either case VAD=0 is to be interpreted that the input signal is estimated as noise and VAD=1 that the input signal is estimated as speech. Another option for level estimates is to use minimum and maximum input energy to track the background and speech respectively. For the variability of the input noise it is possible to calculate the variance of prior frames over a sliding time window. Another solution is to monitor the amount of negative input SNR. This is however based on the assumption that negative SNR only arises due to variations in the input noise. Sliding time window of prior frames implies that one creates a buffer with variables of interest (frame energy or sub-band energies) for a specified number of prior frames. As new frames arrive the buffer is updated by removing the oldest values from the buffer and inserting the newest.
Non-stationary noise can be difficult for all VADs, especially under low SNR conditions, which results in a higher VAD activity compared to the actual speech and reduced capacity from a system perspective. I.e. frames not comprising speech are identified to comprise speech. Of the non-stationary noise, the most difficult noise for the VADs to handle is babble noise and the reason is that its characteristics are relatively close to the speech signal that the VAD is designed to detect. Babble noise is usually characterized both by the SNR relative to the speech level of the foreground speaker and the number of background talkers, where a common definition as used in subjective evaluations is that babble should have 40 or more background speakers. The basic motivation being that for babble it should not be possible to follow any of the included speakers in the babble noise implying that non of the babble speakers shall become intelligible. It should also be noted that with an increasing number of talkers in the babble noise, the babble noise becomes more stationary. With only one (or a few) speaker(s) in the background they are usually called interfering talker(s). A further problematic issue is that babble noise may have spectral variation characteristics very similar to some music pieces that the VAD algorithm shall not suppress.
In the previously mentioned VAD solutions AMR NB/WB, EVRC and G.718 there are varying degrees of problem with babble noise in some cases already at reasonable SNRs (20 dB). The result is that the assumed capacity gain from using DTX can not be realized. In real mobile phone systems it has also been noted that it may not be enough to require reasonable DTX/VBR operation in 15-20 dB SNR. If possible one would desire reasonable DTX/VBR operation down to 5 dB even 0 dB depending on the noise type. For low frequency background noise an SNR gain of 10-15 dB can be achieved for the VAD functionality just by highpass filtering the signal before VAD analysis. Due to the similarity of babble to speech the gain from highpass filtering the input signal is very low.
For VADs based on subband SNR principle when the input signal is divided in a plurality of sub-bands, and the SNR is determined for each band, it has been shown that the introduction of a non-linearity in the subband SNR calculation, called significance thresholds, can improve VAD performance for conditions with non-stationary noise such as babble noise and office background noise.
It has also been noted that the G.718 shows problems with tracking the background noise for some types of input noise, including babble type noise. This causes problems with the VAD as accurate background estimates are essential for any type of VAD comparing current input with an estimated background.
From a quality point of view it is better to use a failsafe VAD, meaning that when in doubt it is better for the VAD to signal speech input than noise input and thereby allowing for a large amount of extra activity. This may, from a system capacity point view, be acceptable as long as only a few of the users are in situations with non-stationary background noise. However, with an increasing number of users in non-stationary environments the usage of failsafe VAD may cause significant loss of system capacity. It is therefore becoming important to work on pushing the boundary between failsafe and normal VAD operation so that a larger class of non-stationary environments are handled using normal VAD operation.
Though the usage of significance thresholds improving VAD performance it has been noted that it may also cause occasional speech clippings, mainly front end clippings of low SNR unvoiced sounds.
As was shown in above it is already common to use some form of threshold adaptation. From prior art there are examples whereVADthr=f(Ntot),VADthr=f(Ntot,Esp), orVADthr=f(SNR,Nv)
Where: VADthr is the VAD threshold, Ntot is the estimated noise energy, Esp is the estimated speech energy, SNR is the estimated signal to noise ratio, and Nv is the estimated noise variations based on negative SNR.