Frame buffers are inter alia deployed in the context of video and audio processing, such as for instance in Voice over IP (VoIP) systems.
In general, network jitter and packet loss conditions can cause degradation in quality for example in conversational speech services in packet switched networks, such as the Internet. The nature of the packet switched communications can introduce variations in transmission of times of the packets (containing frames), known as jitter, which can be seen by the receiver as packets arriving at irregular intervals. However, an audio playback device requires constant input to maintain good audio quality, and no interruptions can be allowed. Thus, if some packets/frames arrive at the receiver after they are required for playback, the decoder may have to consider those frames as lost and perform error concealment.
Typically, a fixed buffer to manage network jitter can be utilised to store incoming frames for a predetermined amount of time (specified e.g. upon reception of the first packet of a stream) to hide the irregular arrival times and provide constant input to the decoder and playback components.
This approach is illustrated in FIG. 1, which shows timelines of a transmitter TX, a receiver RX and a decoder DEC, respectively. On the uppermost timeline, the regularly spaced transmission points of time of frames are illustrated by vertical lines. Similarly, in the middle timeline, the reception points of time of the transmitted frames are illustrated by vertical lines, wherein the reception points of time, for instance due to network jitter, are irregularly spaced. Finally, the lowermost timeline shows the decoding points of time of the frames, which are, due to the fact that the decoder DEC needs to provide constant input to the playback device, regularly spaced. Therein, the relationship between transmission, reception and decoding points of time of the same frame is illustrated by means of dashed arrows. As can be seen by comparing the reception point of time of the first received frame in the middle timeline with its associated decoding point of time in the lowermost timeline, the jitter buffer introduces an additional delay increasing the end-to-end delay since the received frames are stored in the jitter buffer before the decoding process.
A shortcoming of this basic approach is that a jitter buffer management scheme requiring fixed playback timing is inevitably a compromise between low enough buffering delay and low enough number of delayed frames, and finding an optimal trade-off can be a difficult task. For example, in the situation shown in FIG. 1, the frame sent at 100 ms arrives at the receiver RX after it is needed for further processing by decoder DEC, and therefore it needs to be replaced with error concealment (i.e. a “late loss” occurs, shown as a dashed vertical line).
There may exist special environments and applications where the amount of expected jitter can be estimated to remain between predetermined limits. In general, however, the network delay associated with jitter can vary from a scintilla of time to hundreds of milliseconds within the same session. Using a jitter buffer management scheme requiring fixed playback timing with an initial buffering delay set to a large enough value to cover the jitter, according to an expected worst case scenario, would keep the number of delayed packets in control. However, at the same time there may arise a risk of introducing an end-to-end delay which can be too long to enable a natural conversation. In this case applying a fixed jitter buffer management scheme may not be a practical choice in most audio transmission applications operating over a packet switched network, e.g. in VoIP over the 3GPP IP Multimedia Subsystem (IMS).
In contrast to a fixed jitter buffer management scheme, an adaptive jitter buffer management scheme can be used to dynamically control the balance between short enough delay and low enough number of delayed frames. In this approach, an entity controlling the jitter buffer constantly monitors the incoming packet stream and adjusts the buffering delay (or buffering time, these terms are used interchangeably) according to observed changes in the network delay behaviour. If the transmission delay seems to increase or the jitter becomes worse, the buffering delay may need to be increased to meet the network conditions. In the opposite situation, where the transmission delay seems to decrease, the buffering delay can be reduced, and hence, the overall end-to-end delay can be minimised.
One of the challenges in adaptive jitter buffer management is reliable estimation—or actually prediction—of the transmission characteristics. Although adaptation based on the reception statistics of most recent packets usually gives a reasonable estimate on the short-term network behaviour, it may be impossible to avoid the fact that some frames arrive after their scheduled decoding time—i.e. too late for normal decoding, especially when applying relatively strict buffering delay requirement.
Jitter buffer adaptation during active speech requires additional processing to shorten or extend the speech signal (i.e. time scaling, also known as time warping) to maintain good voice quality and intelligibility. For example, suitable methods are disclosed in documents WO 03/021830 and U.S. 2006/0056383. To avoid complex time scaling, a commonly used method for jitter buffer management is to perform the adaptation during comfort noise signal periods typically at beginning of a new talk spurt (i.e. at a speech onset). This approach can be expected to provide low complexity adaptation functionality with high quality, since the comfort noise signal does not carry information that is important for intelligibility or actual voice quality. The minor drawback of the onset adaptive jitter buffer management is that even though the network analyser detects changes in the delay characteristics, the jitter buffer adaptation needs to wait for the next speech onset to take place. However, jitter buffer management solutions apply the onset adaptive approach as part of the adaptation functionality. Where the basic approach is to re-estimate the required buffering time and perform adaptation at each speech onset, while only urgent adaptation steps are taken during active speech.
A basic (adaptive) jitter buffer management approach uses the statistics on the current number of frames in the buffer as an indication of the buffer status. If, for example, the number of frames in the buffer falls below a predetermined (or adaptively determined) threshold, an adaptation step to increase the buffering time can be initiated to take place at the next speech onset in order to decrease the risk of subsequent frames arriving too late. If, however, the number of frames in the buffer grows above another predetermined (or adaptively determined) threshold, an adaptation step to decrease the buffering time can be initiated to reduce delay for improved interactivity.
An alternative approach may use statistics computed over several frames instead of considering only single frames. For instance, the number of instances when the number of frames in the buffer falls below or above predetermined limits over an analysis window consisting of several frames (or time corresponding to several frames' duration) may be counted. Equally well, the average number of frames in the buffer may be considered over an analysis window as the indication of buffer status which is used for controlling the jitter buffer management operation.
When considering the buffer status indication based on the number of frames, one possible way to apply adaptation is to estimate the target buffer level as a number of frames, and then to wait for the selected number of frames to arrive and accumulate in the buffer before decoding (or playing back) the first frame of a talk spurt at the point a speech onset is detected.
This is illustrated in FIG. 2, wherein the structure of FIG. 2. In this example, the speech onset frame is the one transmitted at 20 ms (and received approximately at 36 ms), and the decoding & playback is started after three frames (the first one to be decoded plus two other ones providing assumed 40 ms of jitter protection) have arrived at the buffer. In this example this approach would result in approximately 52 ms buffering delay for the first frame of the talk spurt.
A considerably different jitter buffer management approach is to indicate the buffer status based on the buffering time. The buffering time may for instance be an observed buffering time recorded when a frame is passed to the decoder (or playback device), or a predicted buffering time computed when a frame is received. Methods for predicting a buffering time are for instance disclosed in document WO 2006/044696. As in the aforementioned approaches, which are based on the number of frames stored in the buffer, the buffer adaptation approach based on the buffering time may also consider statistics computed over several frames instead of keeping track of single frames only.
Jitter buffer management based on the buffering time may for instance be performed by estimating the required buffering time (e.g. in milliseconds), and applying this buffering time for the first frame of a talk spurt when a speech onset is encountered.
The example in FIG. 3 illustrates this approach. In this case, the estimated buffering delay of 40 ms is applied for the first frame of the talk spurt simply by directly delaying its decoding & playback by 40 ms.
The previously described jitter buffer management approaches for speech onset handling can be expected to provide, on average, approximately equal performance (both in terms of delay and late-loss). However, there can be special cases where these approaches may fail, either by introducing unnecessarily high buffering delay until the next adaptation step, or by providing inadequate jitter protection, leading to an unnecessarily high rate of late-loss frames.
For instance, problematic cases for jitter buffer management based on the number of frames in the buffer are scenarios where the frame triggering the start of the decoding & playback is an “outlier”—i.e. it arrives too early or too late (with respect to the subsequent frames). In the former case, which is illustrated in FIG. 4, buffer adaptation may provide too short buffering time causing high late-loss rate, while in the latter case, which is illustrated in FIG. 5, the buffering time would be unnecessarily high (and there may be also a risk of buffer overflow) until the next adaptation point.
If the “trigger frame”, denoting the frame triggering the start of the decoding and playback, is lost in the transmission path, the jitter buffer manager will wait until the next frame is successfully received before starting the decoding & playback. In case of a long lost burst this may obviously increase buffering time significantly.
On the other hand, for jitter buffer management based on the buffering time, the critical frame is the actual speech onset frame: if it arrives “early”, the buffering time may be too short (see FIG. 6), causing increased late-loss rate. If it arrives “late”, the buffering time may become unnecessarily long (see FIG. 7).