In a conventional circuit switched telephony network, each telephone exchange receives a synchronization clock signal that is distributed hierarchically to every node in the network, thereby achieving a synchronized communication. However, such a hierarchical synchronization is not always possible in a packet switched network, e.g. when personal computers communicate over the Internet.
In e.g. IP (Internet Protocol)-telephony, voice samples are forwarded from a sending communicating device to a receiving communicating device, and the latency, or delay, of the connection defines the time it takes for a data packet to be transported between the sending communicating device and the receiving communicating device. The packets are stored temporarily in buffers in the nodes of the packet switched network, and the varying storage time in the buffers leads to variations in the delay, which is referred to as a delay jitter. While a circuit switched network normally is designed to minimize the jitter, a packet switched network is designed to maximize the link utilization by queuing the packets in the buffers for subsequent transmission, which will add to the delay jitter.
Protocols used to carry voice signals over the IP network are commonly referred to as VoIP (Voice over Internet Protocols), allowing a unified network to be used for multiple services. An incoming IP-phone call may be automatically routed to an IP-phone located anywhere, and thereby a user is allowed to make and receive phone calls using the same phone number during travelling, regardless of location. However, VoIP involves drawbacks, such as delay, packet loss and the above-described delay jitter. The delay jitter may lead to buffer underrun, when a play-out buffer runs out of voice data to play because the next voice packet has not arrived, but the consequences of the jitter are normally reduced by a de-jittering buffer located in the receiving communicating device. The de-jittering buffer adds a variable extra delay before the audio samples of the packet are played out, to keep the overall delay time constant, or slowly varying, in order to minimize the overall delay at some given packet loss rate depending on the current network conditions. Thereby, the occurrence of buffer underrun due to delay jitter may be avoided, but the overall delay is increased.
Additionally, the clock frequency controlling the sample reception in a receiving communicating device is not exactly the same as the clock frequency of the sending communicating device, due to differencies in e.g. the quartz crystal oscillators of the clocks. The difference between the transmitting clock frequency, fTx, and the receiving clock frequency, fRx, of the samples is commonly referred to as clock skew. The accuracy A of a clock is often expressed in ppm (part per million), and in existing IP-telephony connections, the clock skew is normally less than 60 ppm (parts per million), but may in some cases reach 300 ppm. In a data packet containing M samples, the time period of the packet is M/f, and the actual difference between the packet time period in the transmitter and in the receiver can be expressed as τ=(M/fTx)−(M/fRx), which is sometimes called clock skew parameter, but is hereinafter referred to as the clock skew, τ, which may have a positive or a negative value.
The difference between the point of time indicated by the clock in the receiver and the clock in the transmitter will accumulate over time, and cause problems. If the clock frequency of the transmitter is higher than that of the receiver, the clock skew, τ, is negative and the receiver will continuously receive more samples than it is able to play out following its own clock frequency, which will lead to an overrun of the play-out buffer in the receiver. If, however, the clock frequency in the transmitter is lower than in the receiver, the clock skew is positive and the play-out buffer in the receiver will at certain interval run out of audio samples to play out, i.e. an underrun.
A receiver may have a play-out buffer accommodating only the samples of one packet, and those samples are read from the buffer at play-out. If a new packet arrives before the previous packet has been played-out from the buffer, the packet will be written over before play-out, resulting in a packet slip. Similarly, if the data of the buffer is played-out before arrival of a packet, there is no data to read, which also will result in a packet slip.
Thus, both overrun and underrun of the play-out buffer will cause a packet slip to occur at regular time intervals, when the accumulated error in the expected packet arrival time reaches the packet time period M/fRx of the receiving communicating device, of which M is the number of audio samples in the packet. The time period between the packet slips is inversely dependent on the size of the clock skew, since a large clock skew will lead to more frequent packet slips. Following from the above-described relationships, the mean value TPER of the time period between the packet slips may be calculated as the absolute value of 1/(1−fTx/fRx). The influence of the jitter results in an actual time period between the packet slips that varies around this mean value. Thus, the delay jitter and the clock skew will both contribute to a synchronization error. However, the effects of the delay jitter may be avoided by a de-jittering buffer, as described above, but the clock skew will still result in overrun or underrun of the play-out buffer.
The effects of the clock skew can be reduced by a continuous adjustment of the clock frequencies, e.g. by the use of GPS (Global Positioning System). However, this is not always possible, e.g. when the audio sample rate is controlled by an independent hardware clock in the audio card of a standard personal computer, or when an IP network and a PSTN (Public Switched Telephone Network) are interconnected by a Media Gateway, in which case the play-out rate of the audio samples is always synchronized with the PSTN clock. A so-called Media Gateway is commonly used to connect different types of communication networks, and is able to convert data from the format required for one type of network to the format required for another.
Another method of compensating for the clock skew is by signal processing, e.g. by duplicating a sample value in the play-out buffer each time the receiver clock has gained one sample time relative the transmitter clock, and to correspondingly delete one sample each time the receiver clock has lost one sample time. However, this leads to a degradation in the quality of the play-out. A higher quality is achieved if the addition/deletion of a sample is performed during silence periods, but this is only satisfactory when the background is relatively silent.
In Tõnu Trump: “Maximum Likelihood Trend Estimation in Exponential Noise”, IEEE Transactions on Signal Processing, Vol. 49, No. 9, September 2001, pages 2087-2095, is addressed how to estimate a linear trend in noise, and in particular how to derive a recursive algorithm for estimating said clock skew, which may be used in real-time applications. Further, Tõnu Trump describes in “Compensation for clock skew in voice over packet networks by speech interpolation”, Proc. IEEE International Symposium on Circuits and Systems, Vol. 5, pp. V-608-V-611, May 2004, an algorithm for compensating for the clock skew by performing a more complex signal processing of the received audio samples in a receiving communicating device. The algorithm performs resampling of the number of samples in the play-out buffer in the receiver depending on an estimation of the clock skew, and the resampling involves interpolation of samples, preferably using spline interpolation. Resampling is a process of changing the sampling rate of a signal, either downsampling or upsampling, by dividing/multiplying the sampling rate with an appropriate resampling factor, and interpolation involves construction of additional samples from known samples. While linear interpolation performed on known samples interpolates a linear function between the samples, spline interpolation uses low degree polynomials in each of the intervals between the known samples. However, the above-described theories are difficult to implement in practical communication systems, since they are adapted for complete test vectors, and can not be applied continuously on every received packet.
Thus, the clock skew still presents a problem in applications when the clocks cannot be synchronized, leading to packet losses and disturbances in the audio content.