Encoded frames may be transmitted from a transmitter to a receiver via a packet switched network, such as the Internet.
For a transmission of voice, for example, speech frames may be encoded at a transmitter, transmitted via a packet switched network, and decoded again at a receiver for presentation to a user. During periods when the transmitter has no active speech to transmit, the normal transmission of speech frames may be switched off. The transmitter may generate during these periods a set of comfort noise parameters describing the background noise that is present at the transmitter. These comfort noise parameters may be sent to the receiver in additional silence descriptor (SID) frames. The receiver may then use the received comfort noise parameters to synthesize an artificial, noise-like signal having characteristics close to those of the background noise present at the transmitter.
The nature of packet switched communications typically introduces variations to the transmission times of the packets, known as jitter, which is seen by the receiver as packets arriving at irregular intervals. In addition to packet loss conditions, network jitter is a major hurdle especially for conversational speech services that are provided by means of packet switched networks.
More specifically, an audio playback component of an audio receiver operating in real-time requires a constant input to maintain a good sound quality. Even short interruptions should be prevented. Thus, if some packets comprising audio frames arrive only after the audio frames are needed for decoding and further processing, those packets and the included audio frames are considered as lost due to a too late arrival. The audio decoder will perform error concealment to compensate for the audio signal carried in the lost frames. Obviously, extensive error concealment will reduce the sound quality as well, though.
Typically, a jitter buffer is therefore utilized to hide the irregular packet arrival times and to provide a continuous input to the decoder and a subsequent audio playback component. The jitter buffer stores to this end incoming audio frames for a predetermined amount of time. This time may be specified for instance upon reception of the first packet of a packet stream. A jitter buffer introduces, however, an additional delay component, since the received packets are stored before further processing. This increases the end-to-end delay.
A jitter buffer, from which frames are extracted with a fixed rate due to a fixed decoding timing, is inevitably a compromise between a low end-to-end delay and a low amount of delayed frames, and finding an optimal tradeoff is not an easy task. Although there can be special environments and applications where the amount of expected jitter can be estimated to remain within predetermined limits, in general the jitter can vary from zero to hundreds of milliseconds—even within the same session. Using a fixed decoding timing with the initial buffering delay that is set to a sufficiently large value to cover the jitter according to an expected worst case scenario would keep the amount of delayed frames in control, but at the same time there is a risk of introducing an end-to-end delay that is too long to enable a natural conversation. Therefore, applying a fixed buffering scheme is not the optimal choice in most audio transmission applications operating over a packet switched network, for example in Voice over Internet Protocol (VoIP) over the 3GPP IP Multimedia Subsystem (IMS).
An adaptive jitter buffer management can be used for dynamically controlling the balance between a sufficiently short delay and a sufficiently low amount of delayed frames. In this approach, the incoming packet stream is monitored constantly, and the buffering delay is adjusted according to observed changes in the delay behavior of the incoming packet stream. In case the transmission delay seems to increase or the jitter is getting worse, the buffering delay is increased to meet the network conditions. In an opposite situation, the buffering delay can be reduced, and hence, the overall end-to-end delay is minimized.
The buffering delay modifications can be done for example by adding error concealment frames between received speech frames or by removing speech frames. Buffering delay modifications using such frame insertion/removing is most beneficial during inactive speech. Alternatively, the jitter buffer management may use time scaling to modify the speech frame duration, and hence, adapt the buffering delay.
One of the challenges in adaptive jitter buffer management is the reliable prediction of the transmission characteristics. Although a jitter buffer adaptation based on the reception statistics of most recent packets usually gives a reasonable estimate on the short-term network behavior, it is not possible, at least when applying a relatively strict buffering delay requirement, to avoid that some frames arrive after their scheduled decoding time, that is, too late for normal decoding. It might be desirable to adapt to an increasing delay for example by using time scaling to increase the buffering time before any frames arrive late, but this is not always possible in practice.
In a particularly simple approach, frames arriving after their scheduled decoding time may be discarded and considered as lost frames. In this case, an error concealment operation will replace the missing voice data.
In a more advanced approach, a “late frame processing” may be applied. In this approach, a late-arriving frame is used to update the internal state of the decoder, although the speech corresponding to this late-arriving frame has already been replaced by an error concealment operation. Using the late-arriving frame to update the state of the decoder to match the corresponding state of the encoder provides a quality benefit, since the error concealment operation is not able to update the decoder's internal state in a correct manner. Frames that are decoded based on a mismatching decoder state typically result in somewhat decreased voice quality also in correctly received frames following the one replaced by the error concealment operation.