Reliable and timely delivery of audio data is a critical component in applications such as interactive audio conferencing, broadcasting, and telephony. Audio transmission over packet switched networks can be susceptible to packet loss (which causes dropouts), and queuing delay (which causes high latency), sometimes referred to as network delay. Low latency transmission is particularly important for effective two-way or multiparty conversation. Once latency begins to exceed 250 milliseconds, the interactive give-and-take of natural conversational speech becomes more difficult. People may start to talk over each other and may have difficulty agreeing upon who should talk first. Additionally the increased occurrence of these “doubletalk” scenarios forces acoustic echo cancellation to work harder to cancel out far end audio.
A more subtle and serious problem results from queuing delay variation. In the present description, “queuing delay” refers to delays in audio data packets (or “audio packets”) arriving and being queued up at the destination node. Queuing delay can include network delay and/or other sources of packet delays, such as uneven audio production rate at the source node. As the queuing delay varies, so does the arrival time of each individual audio data packet. For example, if the audio signal is divided into equal 40 millisecond segments, the destination node typically expects the audio packets to arrive at the regular 40 millisecond intervals. Queuing delay variation results in unsteady delivery of the individual audio packets to the destination node. For example, a late packet might arrive at the destination node after 100 milliseconds have elapsed (rather than the expected 40 milliseconds). Subsequent packets queued up behind it might arrive at the destination node in a big burst after an additional 5 milliseconds. In order to present a continuous unbroken audio stream to the listener, a jitter buffer at the destination node can be used to absorb the delay variation. A jitter buffer is a specialized priority queue where the incoming audio packets are ordered by increasing audio timestamp. Incoming audio packets, which may have unpredictable arrival times, are stored in the jitter buffer in sorted order. Audio packets are retrieved from the buffer at a steady rate and can be assembled into a continuous unbroken audio stream for playback. As long as the buffer never becomes empty, there won't be any dropouts in the playback audio stream. The jitter buffer itself introduces some delay of its own, referred to as latency, so it is desirable to keep the buffer size as small as possible.
While it is desirable to have a small jitter buffer size to keep the latency low (for interactive conversation), there is also a need to be able to absorb potentially large queuing delays that may happen from time to time. If the jitter buffer size is kept small, and there is a large delay spike, then many of the late arriving packets will end up being discarded because they will not all fit in the jitter buffer. If the jitter buffer size is made large, then the buffer is able to absorb a large change in the queuing delay, but the latency will be too high for real-time applications. Accordingly, there is a packet loss versus latency tradeoff in the sizing of the jitter buffer.