Current audio or speech coding standards like 3GPP AMR (3GPP TS 26.071) and AMR-WB (3GPP TS 26.171) as well as various ITU-T speech coding standards (e.g. ITU-T Recommendation G.729, ITU-T Recommendation G.718) include a discontinuous transmission scheme (DTX) that suspends the speech transmission during speech inactivity, and instead transmits Silence Insertion Descriptor (SID) frames at significantly reduced bit rate and frame transmission rate as compared to the ones used for encoded active speech. The purpose of DTX is to increase transmission efficiency, which in turn reduces the cost for speech communication and/or increases the number of simultaneously possible telephony connections in a given communication system.
Current state-of-the-art communication systems with DTX transmit regular speech coding frames during active speech segments. During inactive segments, e.g. speech pauses, these systems rather transmit SID frames from which the receiver generates so-called comfort noise as a substitution signal for the inactivity signal. In order to achieve the best possible DTX efficiency, it is desirable that speech coding frames are only transmitted during active speech and not in inactive segments, e.g. during speech pauses.
In order to make this distinction between speech and inactivity, a voice activity detector (VAD) is used at the encoding, or sending, side. During frames corresponding to active speech segments, a VAD flag is raised. This concept suffers in practice, and especially in situations of speech in background noise, from VAD classification errors. That is, periods of inactivity are classified as periods of active speech, and/or vice versa. One of the main problems of VADs is the detection of the speech end points, i.e. the precise point in time where the signal changes from active speech to inactivity. The main reason for this problem is that many speech offsets are slowly decaying before the speech really stops, such that the ending of the talk spurts may very well be covered by background noise. The consequence of this problem may be that such speech offsets are classified as inactivity which may result in that the corresponding signal frames are not encoded, transmitted and reconstructed as active speech but rather as a silence signal for which comfort noise frames are generated. This means that speech offsets (end of speech periods) may be perceived as clipped, leading to significantly reduced quality and even intelligibility of the reconstructed speech. In other words this may lead to a bad user experience.
Current state-of-the-art codecs like AMR and AMR-WB solve this problem by simply delaying the start of the DTX operation with comfort noise synthesis a number of frames after the VAD-detected offset. This is done with a DTX control logic at the encoder, which extends or adds a time period during which an input signal is encoded as active speech even though the VAD flag indicates inactivity. This period is called hangover period and in case of AMR and AMR-WB the hangover period is 7 frames long.
The hangover period is not only used as a means for avoiding speech back-end (or offset) clipping, but also for SID frame parameter analysis. In case of AMR and AMR-WB the first SID frame parameters after a (sufficiently long) talk spurt are not transmitted, but rather computed by the decoder from the speech frame parameters received and stored during the hangover period (3GPP TS 26.092; 3GPP TS 26.192). The purpose of making the SID frame parameter calculation based on the received speech frame parameters during the hangover period is to save transmission resources which should otherwise have been spent on SID frame transmission and to minimize the effect of potential transmission errors on the first SID frame parameters.
The main problem with the hangover period in the described state-of-the-art solutions is that it compromises the efficiency of the DTX scheme. The hangover frames are encoded as active speech despite that they are likely inactivity frames. If the speech comprises frequent separate talk spurts in between inactivity periods, then a significant number of frames are encoded with high bit rate, thus as speech frames, rather than as comfort noise frames.
A related problem arises if the hangover period is shortened in order to improve the efficiency of the DTX scheme. The shorter the hangover period, the more likely it is that it does not properly represent the inactivity noise signal. This may then lead to audible degradations of the comfort noise synthesis immediately at the end of talk spurts.
In AMR and AMR WB the encoder and the decoder keep track of the DTX hangover frames using a state-machine that needs to be synchronous in the encoder and the decoder.