In any IP (Internet Protocol) based communication system there is a need to handle so-called delay jitter. Delay jitter occurs due to uneven delivery rates of packets to the IP endpoints, a variation of packet delivery timing which occurs due to various reasons. Examples are varying processing time in routers due to varying load, high load in access types using shared channels such as HSPA (High-Speed Packet Access) and WLAN (Wireless Local Area Network), etc. All IP-based systems show this kind of behavior, in some cases more than others.
A speech decoder requires an even flow of packets delivered at regular intervals in order to process and render a speech signal. If this even rate cannot be maintained, encoded speech frames delivered too soon after the preceding frame might be dropped and if a speech frame is delivered too late, error concealment will be used to render the speech instead. Both cases result in degraded speech quality.
In VoIP (Voice-over-IP) services, a so-called jitter buffer is used between the packet receiving entity and the speech decoder to act as a speech frame rate equalizer. If this buffer is sufficiently deep, the variation, or the delay jitter, will be handled by the buffer and encoded speech frames can be delivered to the speech decoder at an even rate.
A drawback with a jitter buffer is that if the buffer depth is larger than the delay jitter, an unnecessary delay will be introduced. Since low conversational delay is a key feature of real-time communication services, this degrades the conversational quality. Hence, jitter buffer adaptation is used to change the depth of the buffer during runtime through a control mechanism. The input to this control mechanism is typically statistics assembled during the session making it possible to tune the buffer depth to optimize the trade-off between error concealment operations triggered by the jitter of the transport link and still minimize the conversational delay.
There are different mechanisms available to adapt the jitter buffer depth. They can be divided into two different categories; frame-based adaptive mechanisms and sample-based adaptive mechanisms.
Frame-based mechanisms operate by inserting or removing full speech frames into the buffer. If used during silence periods (i.e. in the beginning or in the end of a talk spurt) the impact of the adaptation action is minor on the media quality. The major drawback occurs if the speech activity is high with few and/or short silence periods. In that case, adaptation will be forced to occur during an active speech period with severe quality degradation as a result.
Sample-based mechanisms operate by stretching and/or compressing the decoded speech signal in the time domain. Different similarity methods can be used to identify patterns in the speech signal which can be expanded or compressed to change the timeline of the speech signal. By doing this, the time each speech frame represents can be changed so that the speech decoder can vary the rate of which it requires delivery of encoded speech frames from the jitter buffer. The consequence is a buffer build-up or a buffer decrease; jitter buffer level adaptation.
Sample-based mechanisms also introduce media quality artifacts when performing the adaptation. The sample-based mechanism works well with stationary signals but transients are more challenging. Further, if the speech signal has some periodic content, which is the case for most popular music, the time scaling operation is easily heard and can be quite annoying.