In packet switched systems, such as the General Packet Radio System (GPRS) wireless system, the uncertainty due to variations in data packet arrival times can have a significant impact on system performance. Reasons for the variation in packet arrival times include congestion of network resources and route variations between successive packets. When the packets contain voice data, as in a VoIP system, in order to obtain a continuous voice output the buffering depth, or buffering delay, at the data packet receiver should be proportional to the variations in packet arrival times.
A conventional fixed initial delay data buffer can remove the variations to some extent. It is, however, very likely that the network conditions will vary depending on congestion of network resources, the location of the receiving terminal and the specific implementation of the network components. With conventional (fixed delay) buffering it is impossible to react to changing network conditions. In addition, when the throughput is consistently low it is impossible to prevent receiver buffer underflows.
For these reasons some type of adaptive buffer control needs to be introduced if optimal operation is desired in terms of buffering delay and minimal interruptions in output voice. The buffer control should be capable of changing the buffering delay in as smooth a manner as possible. Stated another way, it is most desirable if the change of buffering delay is done with the same ratio over longer intervals, than if the buffering delay is first decreased, then increased, and then decreased again, and so on over short intervals.
At least two prior art buffer control techniques have required accurate knowledge of the network end-to-end delay: Ramjee R. (1994), “Adaptive Playout Mechanisms for Packetized Audio Applications in Wide-Area Networks”, in IEEE INFOCOM '94, The Conference on Computer Communications Proceedings, 12-13 June, Toronto, Vol. 2. pp. 680-688, Canada; and Liang Y. J. (2001), “Adaptive Playout Scheduling Using Time-Scale Modification in Packet Voice Communications”, in IEEE International Conference on Acoustics, Speech, and Signal Processing Proceedings, 7-11 May, Salt Lake City, Vol. 3, pp 1445-1448. However, to accurately obtain the end-to-end delay would require the use of synchronized clocks at the sending and receiving terminals, and thus is currently not possible to accurately obtain with most commercially available terminals.
Another technique has been proposed that does not require this type of information: Telefonaktiebolaget LM Ericsson. “Adaptive Jitter Buffering”, WO 00/42749. This approach attempts to estimate network conditions over a fixed sampling interval. While this approach may have some use when pauses or delay spikes (an interruption is considered to be the amount of time the buffer is empty or, more exactly, would have been empty if no time scaling had been introduced) occur at relatively short intervals, if the interval between successive pauses is greater than the sampling interval then there can be cases where no pause will occur during one of the sampling intervals. From this it follows that the control mechanism will decrease the buffering delay, as opposed to the situation where the pause would have occurred. If the pause occurs during the next sampling interval it causes an undesired pause in speech due to buffer underflow. After the speech interruption the buffering delay is increased once again during the following sampling interval. As can be appreciated, this type of operation can readily lead to the situation where the buffering delay is decreased/increased/decreased and so on by the control mechanism, resulting in unnecessary fluctuations in the playout rate. In addition, some fixed number of packets must be accumulated before performing the buffering delay change (sampling interval). This results in a slower reaction time when packets arrive at a reduced rate, and potentially increases the possibility of interruptions in speech output because the buffering delay is increased only after the sampling interval. In the approach of WO 00/42749 the change in the buffer delay is accomplished by discarding or delaying packets, where more specifically the change is made during a silent period by adding or removing speech frames containing silence. However, adding or removing only silence leads to changes in the time relationship between silent periods and speech periods, which can result in unnatural sounding very long or very short silences between sentences and possibly also between words. The duration of the silent periods can vary from sentence to sentence, or from word to word, and therefore can result in an unnatural rhythm to the speech.
In general, the adaptive buffer control should be applied only when it is needed. The situation in a packet switched network may well be such that the packets arrive in bursts, and between each burst there is a long (perhaps several seconds) delay. This is not a problem if the long-term arrival interval average is the same as the rate at which the packets were created. This means only that the physical buffer size should be long enough at the receiver side to accommodate the variations. However, in the design of the adaptive buffer control this should be considered, since the play-out rate of voice should not annoyingly fluctuate if the buffering delay fluctuates.
It can thus be appreciated that the current approaches to dealing with the variability of arrival times of data packets containing voice or video signals are not satisfactory, and do not adequately address the problems inherent in providing natural sounding voice in VoIP and other types of data packet-based network systems.
In the above-cited commonly assigned U.S. Patent Applications a buffer control approach uses estimates of the interruption of packet arrival. While well suited for use in many network environments, in some highly dynamic environments, such as an Enhanced Data rates for GSM Evolution (EDGE) environment where speech packets arrives in a bursty manner, the use of a different type of buffer control mechanism can be more optimum. For example, in this type of network packets representing two seconds of speech can arrive during a very short interval, after which there may be a two second pause in the arrival of speech-containing packets.