It is known to divide audio signals into temporal segments, time slots, frames or the like, and to encode the frames for transmission. The audio frames may be encoded in an encoder at a transmitter site, transmitted via a network, and decoded again in a decoder at a receiver site, for presentation to a user. The audio signals to be transmitted may be comprised of segments, which comprise relevant information and thus should be encoded and transmitted, such as, for example, speech, voice, music, DTMF, or other sounds, as well as of segments, which are considered irrelevant, i.e. background noise, silence, background voices, or other noise, and thus should not be encoded and transmitted. Typically, information tones (such as DTMFs) and music signals are content that should be classified as relevant, active (i.e. to be transmitted). Background noise, on the other hand, is mostly classified as not relevant, non-active, that is not transmitted.
To this end, there are already known methods which try to distinguish segments within the audio signal which are relevant from segments which are considered irrelevant.
One example of such an encoding method is the voice activity detection (VAD) algorithm, which is one of the major components affecting the overall system capacity. The VAD algorithm classifies each input frame either as active voice/speech (to be transmitted) or as non-active voice/speech (not to be transmitted).
During periods when the transmitter has active speech to transmit the VAD algorithm provides information about speech activity and the encoder encodes the corresponding segments with an encoding algorithm in order to reduce transmission bandwidth.
During periods when the transmitter has no active speech to transmit, the normal transmission of speech frames may be switched off. The encoder may generate during these periods instead a set of comfort noise parameters describing the background noise that is present at the transmitter. These comfort noise parameters may be sent to the receiver, usually at a reduced bit-rate and/or at a reduced transmission interval compared to the speech frames. The receiver uses the comfort noise (CN) parameters to synthesize an artificial, noise-like signal having characteristics close to those of the background noise signal present at the transmitter.
This alteration of speech and non-speech periods is called Discontinuous Transmission (DTX).
Current VAD algorithms are considered relatively conservative regarding the voice activity detection. This results in a relatively high voice activity factor (VAF), i.e. the percentage of input segments classified as active speech. The AMR and AMR-WB VAD algorithms provide relatively low VAF values in normal operating conditions.
However, reliable detection of speech is a complicated task especially in challenging background noise conditions (e.g. babble noise at low Signal-to-Noise Ratio (SNR) or interfering talker in the background). The known VAD algorithms may lead to relatively high VAF values in such conditions. While this is not a problem for speech quality, it may be a capacity problem in terms of inefficient usage of radio resources.
However, when employing VAD algorithms, which characterize less segments as active segments, i.e. resulting in lower voice activity factor, the amount of clipping may be increased causing very annoying audible effects for the end-user. In case of challenging background noise conditions, the clipping typically occurs in cases where the actual speech signal is almost inaudible due to strong background noise. When the codec then switches to CN, even for a short period, in the middle of an active speech region, it will be easily heard by the end-user as an annoying artifact. Although the CN partly mitigates the switching effect, the change in the signal characteristics when switching from active speech to CN (or vice versa) in noisy conditions is in most cases clearly audible. The reason for this is that CN is only a rough approximation of the real background noise and therefore the difference to the background noise that is present in the frames that are received and decoded as active speech is obvious, especially when the highest coding modes of the AMR encoder are used. The clipping of speech and contrast between the CN and the real background noise can be very annoying to the listener.