The goal of any communication system is to transmit real-time signals from one location to another. Currently, there are two types of networks that can be used to convey real-time media signals; circuit switched networks and packet switched networks. A circuit switched network provides a dedicated point to point communication path between two or more users. A media signal is transmitted over the dedicated circuit, received by the other side and played out to a user. A packet switched network, in contrast, divides a message to be sent into data packets, sent individually over the network, and reassembled at a final location before being delivered to a user. To ensure proper re-assembly of the blocks of data at the receiving end, various control data, such as sequence and verification information, may be appended to each packet in the form of a packet header, or otherwise associated with the packet. At the receiving end, the packets are then reassembled and transmitted to an end user in a format compatible with the user's equipment.
Packet switched networks are now competing with conventional circuit switched networks to provide interactive communications services such as telephony and multi-media conferencing via the Internet. This technology is presently known as internet telephony, IP telephony or, where voice is involved, Voice over IP (VoIP). In VoIP networks, audio signals are digitized into frames and transmitted as packets over an IP network. The transmitter may send these packets at a constant transmission rate. An appropriately configured receiver will receive the packets, extract the frames of digital data and convert the digital data into analog output using a digital to analog (D/A) converter. Although the packets are transmitted at a constant data rate, packets will not necessarily arrive at their destination at a constant rate. Rather, because of variable delays through the network, and different transmit paths taken by packets, there is a packet delay variation (pdv) at the receiver. Because a digital audio data (for example a digitized voice conversation) must be played out at a constant output rate in order to reconstruct a high quality audio signal, the delay variation between packets is undesirable.
A known solution for this problem is to implement a jitter buffer in the receiver. A jitter buffer is a buffer that stores frames as they are received from the network, and outputs them at a constant output rate, thus absorbing the packet delay variation. As long as the average rate of reception of the packets is equal to the constant output rate, the jitter buffer allows the packets to be output at the constant output rate even though they are not necessarily received at a constant rate. The jitter buffer by its nature introduces delay into the communication path; that is there is a delay while the packet travels through the jitter buffer until it is processed, or ‘played out’ at the receiver. The delay between receipt of the packet and the play out of the packet is referred to hereinafter as the play out time offset of the receiver.
In the context of interactive real-time communications such as internet telephony, delay is particularly problematic, since participants to such communications expect the network connection to simulate immediate, in-person interaction, without delay. Provided with more than a maximum tolerable end-to-end delay (a matter of design choice), conversation participants may be faced with the unsettling experience of having to wait some time after one person speaks before the other person hears what was spoken. Consequently, in most telecommunications networks carrying real-time media signals, there is a need to reduce or minimize the total end to end transmission delay. One method of doing so is to control the size of the jitter buffer.
The jitter buffer should be large enough to store a sufficient number of packets to insure that the slowest data in an audio sequence has sufficient time to arrive at the receiver before playback. Too small a jitter buffer can give rise to packet loss, which produces audible pops and clicks and other distortion. However, a large jitter buffer increases the playout time offset, resulting in echo and talker overlap in the received signal.
Several different methods have been used to adapt the jitter buffer size during operation to ensure optimum capture of packets while minimizing delay. For example, in one system, the variation of packet level in the jitter buffer is measured over a long period of time, and the jitter buffer size is incrementally adapted to match the calculated jitter. Such a system works well in transmission networks that provide consistent packet performance, such as Asynchronous Transfer Mode (ATM) networks, but are not as useful in systems with highly variable packet inter-arrival times.
A second approach for adapting the jitter buffer size is to count the number of packet that arrive late, and create a ratio of these packets to the number of packets successfully processed. The ratio is used to adjust the jitter buffer to target a predetermined allowable packet ratio. This approach works best with networks having highly variable packet inter-arrival times, such as IP networks, but is not as efficient in system having consistent packet arrival times, such as ATM networks.
It would be desirable to identify a method and system for adaptively selecting jitter buffer size that is useful in a variety of transmission systems.