Background noise estimates are used as a characterization of the background noise and is of use in applications such as: Noise suppression, Voice Activity Detectors, SNR (Signal-to-Noise Ratio) estimates.
Among the more important properties of the background noise estimate is that it should be able to track changes in the input noise characteristics and it should also be able to handle step changes such as sudden changes in the noise characteristics and/or level while still avoiding using non-noise segments to update the background noise estimate.
In speech coding systems used for conversational speech it is common to use discontinuous transmission (DTX) to increase the efficiency of the encoding. It is also possible to use variable bit rate (VBR) encoding to reduce the bit rate. The reason is that conversational speech contains large amounts of pauses embedded in the speech, e.g. while one person is talking the other one is listening. So with discontinuous transmission (DTX) the speech encoder is only active about 50 percent of the time on average and the rest is encoded using comfort noise. One example that uses DTX is the AMR (Adaptive Multi Rate) Narrowband. For high quality DTX operation, i.e. without degraded speech quality, it is important to detect the periods of speech in the input signal this is done by the Voice Activity Detector (VAD). The DTX logic uses the VAD results to decide how/when to switch between speech and comfort noise.
FIG. 1 shows an overview block diagram of a generalized VAD 180, which takes the input signal 100, divided into data frames, 5-30 ms depending on the implementation, as input and produces VAD decisions as output 160. I.e. a VAD decision 160 is a decision for each frame whether the frame contains speech or noise which is also referred to as VAD_flag.
The generic VAD 180 comprises a feature extractor 120 which extracts the main feature used for VAD decisions from the input signal, one such example is subband energy used as a frequency representation of each frame of the input signal. For the decision making a background estimator 130 provides subband energy estimates of the background signal (estimated over earlier input frames). An operation controller 110 collects characteristics of the input signal, such as long term noise level, long term speech level for long term SNR calculation and long term noise level variation as input signals to a primary voice detector.
A preliminary decision, “vad_prim” 150, is made by a primary voice activity detector 140 and is basically just a comparison of the features for the current frame and background features (estimated from previous input frames), where a difference larger than a threshold causes an active primary decision. A hangover addition block 170 is used to extend the primary decision based on past primary decisions to form the final decision, “vad_flag” 160. The reason for using hangover is mainly to reduce/remove the risk of mid speech and backend clipping of speech bursts. However, the hangover can also be used to avoid clipping in music passages. The operation controller 110 may adjust the threshold(s) for the primary voice activity detector 140 and the length of the hangover addition 170 according to the characteristics of the input signal.
The background estimation can be done by two basically different principles, either by using the primary decision i.e. with decision (or decision metric) feedback indicated by dash-doted line in FIG. 1 or by using some other characteristics of the input signal i.e. without decision feedback. It is also possible to use combinations of the two strategies.
There are a number of different features that can be used but one feature utilized in VADs is the frequency characteristics of the input signal. Calculating the energy in frequency subbands for the input signal is one popular way of representing the input frequency characteristics. In this way one of the background noise features is the vector with the energy values for each subband. These are values that characterize the background noise in the input signal in the frequency domain.
To achieve tracking the actual noise estimate update can be made in at least three different ways. The first way is to use an AR-process (Autoregressive process) per frequency bin to handle the update. Basically for this type of update the step size of the update is proportional to the observed difference between current input and the current background estimate. The second way is to use multiplicative scaling of current estimate with the restriction that the estimate never is bigger than the current input or smaller than a minimum value. This means that the estimate is increased for each frame until it is higher than the current input. In that situation the current input is used as estimate. The third way is to use minimum technique where the estimate is the minimum value during a sliding time window of prior frames. This basically gives a minimum estimate which is scaled, using a compensation factor, to get and approximate average estimate for stationary noise. Sliding time window of prior frames implies that one creates a buffer with variables of interest (frame energy or sub-band energies) for a specified number of prior frames. As new frames arrive the buffer is updated by removing the oldest values from the buffer and inserting the newest.
While the minimum estimation technique has low complexity the resulting estimate may not be accurate enough for varying background noise. The motivation is that a long sliding time window may at times result in a too low estimate while a short sliding time window may result in an estimate that is too large. With the sliding time window it is also not clear how the background estimator will work for music type input.
Using the multiplicative scaling of the current estimate with the restriction that the estimate can not be bigger than the current value shows better tracking than the pure minimum estimation technique but there is still a problem in tracking quick increases in a varying background. Basically the tracking works until the increase rate exceeds the rate limited by the multiplicative scaling.
Using AR-processes for background update has the potential to be efficient at tracking the background noise level. However, a decision error where the updating of the background estimate is made with non-noise data can result in a poor estimate of the background. Especially for VAD solutions relying on decision feedback an inaccurate background estimate can lead to even more decision errors.
So to avoid updating the background estimate with non-noise data there are usually many restrictions on when to update the background estimate, at least upwards. While the many restrictions will reduce the risk of using non-noise data for update the restrictions will at the same time reduce the ability of the estimator to track varying background noise, especially in the case of non-stationary background noises. By allowing the estimates to always be updated downwards the effect of some error decisions can be reduced. A drawback of always updating downwards is that for non-stationary noise it will in the end lead to too low estimates. The motivation here is similar to the minimum estimation where in this case there is no length defined for the sliding time window.
There is also the possibility to end up in background noise update deadlock. That is the background logic has ended up in a state where it is not allowed to change the background noise even though the input currently is noise only input. This can happen if there is a sudden change in the noise characteristics or noise level so that the input is no longer recognized as noise. For this reason there is usually a recovery algorithm. While this usually works for stationary noise it may not always work for babble noise (which by nature is relatively close to speech in characteristics).
While energy based pause detectors can work well in good SNR conditions they have limited functionality in low SNR conditions.