In speech coding systems used for conversational speech it is common to use discontinuous transmission (DTX) to increase the efficiency of the encoding. The reason is that conversational speech contains large amounts of pauses embedded in the speech, e.g., while one person is talking the other one is listening. So with DTX the speech encoder is only active about 50 percent of the time on average and the rest can be encoded using comfort noise. Some example codecs that have this feature are the Adaptive Multi-Rate Narrow Band (AMR NB) and Enhanced Variable Rate Codec (EVRC). AMR NB uses DTX and EVRC uses variable bit rate (VBR), where a Rate Determination Algorithm (RDA) decides which data rate to use for each frame, based on a VAD decision. In DTX operation the speech active frames are coded using the codec while frames between active regions are replaced with comfort noise. Comfort noise parameters are estimated in the encoder and sent to the decoder using a reduced frame rate and a lower bit rate than the one used for the active speech.
For high quality DTX operation, i.e. without degraded speech quality, it is important to detect the periods of speech in the input signal. This is typically done by the Voice Activity Detector (VAD) (which is used in both for DTX and RDA). FIG. 1 shows an overview block diagram of an example of a generalized VAD 100, which takes the input signal 111, typically divided into data frames of 5-30 ms depending on the implementation, as input and produces VAD decisions as output, typically one decision for each frame. That is, a VAD decision is a decision for each frame whether the frame contains speech or noise.
The preliminary decision, vad_prim 113, is in this example made by the primary voice detector 101 and is in this example basically just a comparison of the features for the current frame and the background features (typically estimated from previous input frames), where a difference larger than a threshold causes an active primary decision. In other examples, the preliminary decision can be achieved in other ways, some of which are briefly discussed further below. The details of the internal operation of the primary voice detector is not of crucial importance for the present disclosure and any primary voice detector producing a preliminary decision will be useful in the present context. The hangover addition block 102 is in the present example used to extend the primary decision based on past primary decisions to form the final decision, vad_flag 115. The reason for using hangover is mainly to reduce/remove the risk of mid speech and backend clipping of speech bursts. However, the hangover can also be used to avoid clipping in music passages.
It is also possible to add additional hangover for the purpose of DTX. In FIG. 1 this has been illustrated by the optional output vad_flag_dtx 117. It should be noted that it is not uncommon that there is just one output vad_flag but that the hangover logic uses other settings when the output is to be used for DTX. In this description, the two final decision outputs vad_flag 115 and vad_flag_dtx 117 will be separated in most embodiments, in order to simplify the description. However, solutions based on alternative hangover settings and one single output are also applicable.
There are two main reasons for using different final decision outputs or hangover setting depending on whether the VAD decision is used for DTX or not. First, from a speech quality point of view there are higher requirements on the VAD when it is used for DTX. Therefore it is desirable to make sure that the speech has ended before switching to comfort noise. The second motivation is that the additional hangover can be used for estimation of the characteristics of background noise. For example in AMR NB the first comfort noise estimate is done in the decoder based on the specific DTX hangover used.
As mentioned before, there are a number of different features that can be used for VAD detection. One possible feature is to look just at the frame energy and compare this with a threshold to decide if the frame contains speech or not. This scheme works reasonably well for conditions where the Signal-to-Noise Ratio (SNR) is good but not for low SNR cases. In low SNR other metrics are preferably used, e.g., comparing the characteristics of the speech and the noise signals. For real-time implementations, an additional requirement on VAD functionality is computational complexity, which is reflected in the frequent representation of sub-band SNR VADs in standard codecs. The sub-band VAD typically combines the SNRs of the different sub-bands to a common metric which is compared to a threshold for the primary decision.
The VAD 100 comprises a feature extractor 106 providing the feature sub-band energy, and a background estimator 105, which provides sub-band energy estimates. For each frame, the VAD 100 calculates features. To identify active frames, the feature(s) for the current frame are compared with an estimate of how the feature “looks” for the background signal.
The hangover addition block 102 is used to extend the VAD decision from the primary VAD based on past primary decisions to form the final VAD decision, “vad_flag”, i.e. older VAD decisions are also taken into account. As mentioned before, the reason for using hangover is mainly to reduce/remove the risk of mid speech and backend clipping of speech bursts. However, the hangover can also be used to avoid clipping in music passages. An operation controller 107 may adjust the threshold(s) for the primary detector and the length of the hangover addition according to the characteristics of the input signal.
There are also known solutions where multiple features with different characteristics are used for the primary decision. For VADs based on the sub-band SNR principle, it has been shown that the introduction of a non-linearity in the sub-band SNR calculation, sometimes referred to as significance thresholds, can improve VAD performance for conditions with non-stationary noise, e.g., babble or office noise. However, in these cases there is typically one primary decision that is used for adding hangover, which may be adaptive to the input signal conditions, to form the final decision. Also, many VADs have an input energy threshold for silence detection, i.e., for low enough input levels the primary decision is forced to the inactive state.
One example where significance thresholds were used to create a dual VAD solution is described in the published International patent application WO2008/143569 A1. In this case, the dual VADs were used to improve background noise update and music detection. However, only an aggressive primary VAD was used for the final vad_flag decision.
In WO2008/143569 A1, a metric based on a low-pass filtered short term activity was used for detecting the existence of music. This low-pass filtered metric provides a slowly varying quantity, suitable for finding more or less continuous types of sound, typical for e.g. music. An additional vad_music decision may then be provided to the hangover addition, making it possible to treat music sound in a particular manner.
There are several different ways to generate multiple primary VAD decisions. The most basic would be to use the same features as the original VAD but achieve a second primary decision using a second threshold. Another option is to switch VAD according to estimated SNR conditions, e.g., by using energy for high SNR conditions and switching to sub-band SNR operation for medium and low SNR conditions.
In the published International patent application WO2011/049516 A1, a voice activity detector and a method therefore are disclosed. The voice activity detector is configured to detect voice activity in a received input signal. The VAD comprises a combination logics configured to receive a signal from a primary voice detector of the VAD indicative of a primary VAD decision. The combination logics further receives at least one signal from an external VAD indicative of a voice activity decision from an external VAD. A processor combines the voice activity decisions indicated in the received signals to generate a modified primary VAD decision. The modified VAD decision is sent to a hangover addition unit.
One problem with hangover is to decide when and how much to use. From a speech quality point of view, addition of hangover is basically positive. However, it is not desirable to add too much hangover since any additional hangover will reduce the efficiency of the DTX solution. As it is not desirable to add hangover to every short burst of activity, there is usually a requirement of having a minimum number of active frames from the primary detector vad_prim before considering the addition of some hangover to create the final decision vad_flag. However, to avoid clipping in the speech it is desirable to keep this required number of active frames as low as possible.
For non-stationary noise a low number of required active frames might allow the noise itself to cause long enough VAD events that will trigger the addition of hangover. So in order to avoid excessive activity, such a solution does usually not allow for long hangovers.
Another problem with a required number of active frames before adding hangover for a high efficient VAD is its ability to detect the short pauses within an utterance. In this case, there is an utterance that has been detected correctly, but the speaker makes a slight pause before continuing. This causes the VAD to detect the pause and once more requires a new period of active primary frames before any hangover at all is added. This can cause annoying artifacts with back end clipping of trailing speech segments such as utterances ending with unvoiced explosives.