For a transmission of voice, speech frames may be encoded at a transmitter, transmitted via a network, and decoded again at a receiver for presentation to a user.
During periods when the transmitter has no active speech to transmit, the normal transmission of speech frames may be switched off. This is referred to as discontinuous transmission (DTX) mechanism. Discontinuous transmission saves transmission resources when there is no useful information to be transmitted. In a normal conversation, for instance, usually only one of the involved persons is talking at a time, implying that on an average, the signal in one direction contains active speech only during roughly 50% of the time. The transmitter may generate during these periods a set of comfort noise parameters describing the background noise that is present at the transmitter. These comfort noise parameters may be sent to the receiver. The transmission of comfort noise parameters usually takes place at a reduced bit-rate and/or at a reduced transmission interval compared to the speech frames. The receiver may then use the received comfort noise parameters to synthesize an artificial, noise-like signal having characteristics close to those of the background noise present at the transmitter.
In the Adaptive Multi-Rate (AMR) speech codec and the AMR Wideband (AMR-WB) speech codec, for example, a new speech frame is generated in 20 ms intervals during periods of active speech. Once the end of an active speech period is detected, the discontinuous transmission mechanism keeps the encoder in the active state for seven more frames to form a hangover period. This period is used at a receiving end to prepare a background noise estimate, which is to be used as a basis for the comfort noise generation during the non-speech period. After the hangover period, the transmission in switched to the comfort noise state, during which updated comfort noise parameters are transmitted in silence descriptor (SID) frames in 160 ms intervals. At the beginning of a new session, the transmitter is set to the active state. This implies that at least the first seven frames of a new session are encoded and transmitted as speech, even if the audio signal does not include speech.
Audio signals including speech frames and comfort noise parameters may be transmitted from a transmitter to a receiver for instance via a packet switched network, such as the Internet.
The nature of packet switched communications typically introduces variations to the transmission times of the packets, known as jitter, which is seen by the receiver as packets arriving at irregular intervals. In addition to packet loss conditions, network jitter is a major hurdle especially for conversational speech services that are provided by means of packet switched networks.
More specifically, an audio playback component of an audio receiver operating in real-time requires a constant input to maintain a good sound quality. Even short interruptions should be prevented. Thus, if some packets comprising audio frames arrive only after the audio frames are needed for decoding and further processing, those packets and the included audio frames are considered as lost. The audio decoder will perform error concealment to compensate for the audio signal carried in the lost frames. Obviously, extensive error concealment will reduce the sound quality as well, though.
Typically, a jitter buffer is therefore utilized to hide the irregular packet arrival times and to provide a continuous input to the decoder and a subsequent audio playback component. The jitter buffer stores to this end incoming audio frames for a predetermined amount of time. This time may be specified for instance upon reception of the first packet of a packet stream. A jitter buffer introduces, however, an additional delay component, since the received packets are stored before further processing. This increases the end-to-end delay. A jitter buffer can be characterized by the average buffering delay and the resulting proportion of delayed frames among all received frames.
A jitter buffer using a fixed delay is inevitably a compromise between a low end-to-end delay and a low number of delayed frames, and finding an optimal tradeoff is not an easy task. Although there can be special environments and applications where the amount of expected jitter can be estimated to remain within predetermined limits, in general the jitter can vary from zero to hundreds of milliseconds—even within the same session. Using a fixed delay that is set to a sufficiently large value to cover the jitter according to an expected worst case scenario would keep the number of delayed frames in control, but at the same time there is a risk of introducing an end-to-end delay that is too long to enable a natural conversation. Therefore, applying a fixed buffering is not the optimal choice in most audio transmission applications operating over a packet switched network.
An adaptive jitter buffer management can be used for dynamically controlling the balance between a sufficiently short delay and a sufficiently low number of delayed frames. In this approach, the incoming packet stream is monitored constantly, and the buffering delay is adjusted according to observed changes in the delay behavior of the incoming packet stream. In case the transmission delay seems to increase or the jitter is getting worse, the buffering delay is increased to meet the network conditions. In an opposite situation, the buffering delay can be reduced, and hence, the overall end-to-end delay is minimized.
One of the challenges in adaptive jitter buffer management is the reliable estimation of the transmission characteristics.