In communication systems utilizing discontinuous transmission (DTX) it is important to find a balance between efficiency and not reducing quality. In such systems an activity detector is used to indicate active signals, e.g. speech or music, which are to be actively coded, and segments with background signals which can be replaced with comfort noise generated at the receiver side. If the activity detector is too efficient in detecting non-activity, it will introduce clipping in the active signal, which is then perceived as subjective quality degradation when the clipped active segment is replaced with comfort noise. At the same time, the efficiency of the DTX is reduced if the activity detector is not efficient enough and classifies background noise segments as active and then actively encodes the background noise instead of entering a DTX mode with comfort noise. In most cases the clipping problem is considered worse.
FIG. 1 shows an overview block diagram of a generalized sound activity detector, SAD or voice activity detector, VAD, which takes an audio signal as input and produces an activity decision as output. The input signal is divided into data frames, i.e. audio signal segments of e.g. 5-30 ms, depending on the implementation, and one activity decision per frame is produced as output.
A primary decision, “prim”, is made by the primary detector illustrated in FIG. 1. The primary decision is basically just a comparison of the features of a current frame with background features, which are estimated from previous input frames. A difference between the features of the current frame and the background features which is larger than a threshold causes an active primary decision. The hangover addition block is used to extend the primary decision based on past primary decisions to form the final decision, “flag”. The reason for using hangover is mainly to reduce/remove the risk of mid and backend clipping of burst of activity. As indicated in the figure, an operation controller may adjust the threshold(s) for the primary detector and the length of the hangover addition according to the characteristics of the input signal. The background estimator block is used for estimating the background noise in the input signal. The background noise may also be referred to as “the background” or “the background feature” herein.
Estimation of the background feature can be done according to two basically different principles, either by using the primary decision, i.e. with decision or decision metric feedback, which is indicated by dash-dotted line in FIG. 1, or by using some other characteristics of the input signal, i.e. without decision feedback. It is also possible to use combinations of the two strategies.
An example of a codec using decision feedback for background estimation is AMR-NB (Adaptive Multi-Rate Narrowband) and examples of codecs where decision feedback is not used are EVRC (Enhanced Variable Rate CODEC) and G.718.
There are a number of different signal features or characteristics that can be used, but one common feature utilized in VADs is the frequency characteristics of the input signal. A commonly used type of frequency characteristics is the sub-band frame energy, due to its low complexity and reliable operation in low SNR. It is therefore assumed that the input signal is split into different frequency sub-bands and the background level is estimated for each of the sub-bands. In this way, one of the background noise features is the vector with the energy values for each sub-band, These are values that characterize the background noise in the input signal in the frequency domain.
To achieve tracking of the background noise, the actual background noise estimate update can be made in at least three different ways. One way is to use an Auto Regressive, AR-process per frequency bin to handle the update. Examples of such codecs are AMR-NB and G.718. Basically, for this type of update, the step size of the update is proportional to the observed difference between current input and the current background estimate. Another way is to use multiplicative scaling of a current estimate with the restriction that the estimate never can be bigger than the current input or smaller than a minimum value. This means that the estimate is increased each frame until it is higher than the current input. In that situation the current input is used as estimate. EVRC is an example of a codec using this technique for updating the background estimate for the VAD function. Note that EVRC uses different background estimates for VAD and noise suppression. It should be noted that a VAD may be used in other contexts than DTX. For example, in variable rate codecs, such as EVRC, the VAD may be used as part of a rate determining function.
A third way is to use a so-called minimum technique where the estimate is the minimum value during a sliding time window of prior frames. This basically gives a minimum estimate which is scaled, using a compensation factor, to get and approximate average estimate for stationary noise.
In high SNR cases, where the signal level of the active signal is much higher than the background signal, it may be quite easy to make a decision of whether an input audio signal is active or non-active. However, to separate active and non-active signals in low SNR cases, and in particular when the background is non-stationary or even similar to the active signal in its characteristics, is very difficult.
The performance of the VAD depends on the ability of the background noise estimator to track the characteristics of the background—in particular when it comes to non-stationary backgrounds. With better tracking it is possible to make the VAD more efficient without increasing the risk of speech clipping.
While correlation is an important feature that is used to detect speech, mainly the voiced part of the speech, there are also noise signals that show high correlation. In these cases the noise with correlation will prevent update of background noise estimates. The result is a high activity as both speech and background noise is coded as active content. While for high SNRs (approximately >20 dB) it would be possible to reduce the problem using energy based pause detection, this is not reliable for the SNR range 20 dB down to 10 dB or possibly 5 dB. It is in this range that the solution described herein makes a difference.