One area undergoing fast development within telecommunications is voice over IP (VOIP) (e.g., Transmission Control Protocol/Internet Protocol—TCP/IP). Developments in this area initially focused on making phone calls at very low costs. Now, developments within this telecommunications area seem promising for new and different business applications. Because both speech and data use the same network and the same transmission protocol, it should be much easier to implement different information applications (e.g., call center, call screen, unified messages, etc.) with VOIP than with traditional telecommunication technologies. However, VOIP applications typically group voice or speech data to form packets, which are sent over shared common networks. Due to the nature of such networks, specific technical problems like the loss of packets, delay of packets, and jitter often occur.
Jitter can be described as the variation in arrival of consecutive packets. Typically, in real-time services such as in voice transmission, a packet encoded with speech is sent every 20 ms, which corresponds to 160 samples when using a sampling frequency of 8 kHz. Since delays vary throughout a network, different packets are delayed differently. Moreover, clocks of transmitting and receiving terminal units are not synchronized to one another. In order to smooth out delay variations, a receiving system or the receiving module of a terminal (i.e., a terminal that generally functions both as a receiver and a transmitter) is usually provided with a jitter buffer.
The relative size of the jitter buffer bears an important relation to resulting speech quality. If the size of the jitter buffer is too large, the one-way delay from mouth to ear will be too large, and the perceived quality will be degraded. For example, ITU-T Recommendations state that the one-way delay should be less than 150 ms for a regular telephony service.
If the jitter buffer is too small, however, packets delayed more than the size of the jitter buffer will arrive too late for any speech synthesis, and will be seen as lost. Therefore, an adaptive jitter buffer is needed to balance the size of the jitter buffer (i.e., delay at the receiving side) against packet loss.
Delay may also vary with time. In order to handle such variations, the size of the jitter buffer (i.e., the number of samples that the speech parameters within it would represent) needs to be adaptable. The jitter buffer can be measured and adapted in different ways. One conventional method measures jitter buffers through checking maximum variations in arrival times for the received packets. There are also various methods for performing the actual jitter buffer adaptation. For example, one conventional method performs a jitter buffer adaptation by using the beginning of a talk-spurt to reset the jitter buffer to a specified level. The distance, in number of samples, between two consecutive talk-spurts is increased at the receiving side if the jitter buffer is too small (e.g., during silence). Likewise, the number of samples is decreased if the jitter buffer is too large. Through this action the size of the jitter buffer is adaptable. In IP telephony solutions using, for example, the RTP protocol (Real Time Protocol), the marker flag in the RTP header identifies the beginning of a talk-spurt. Accordingly, the size of the jitter buffer can be changed when such a packet arrives at the receiving side.
However, the above-mentioned conventional solution statically resets the jitter buffer to a certain level at the beginning of each talk-spurt. It does not, for example, cover the case when network conditions change or if a wrong decision has been taken. Furthermore, if the jitter buffer size becomes too small, packets will be lost. Similarly, if the jitter buffer becomes too large, an unnecessary delay is introduced. In both cases, the perceived speech quality will be affected. This is undesirable. Moreover, because the jitter buffer is adapted only when there is a speech silence period, the problems will be even more severe during periods of long speech where no jitter buffer adaptations occur.