Traditional telephony via the PSTN (Public Switched Telephone Network) reserves bandwidth in advance of a call and dedicates that bandwidth for the duration of the call. Additionally, it preserves the timing relationships in speech between sender and receiver through use of a common precise clock. This means that the speech is encoded at the sender exchange (with a 125 microsecond sample period), transmitted across the network and decoded at the receiver exchange with both encoding/decoding processes essentially synchronised because they share a common clock).
Packet-based telephony, in particular Voice over IP (VoIP), employing local area networks (LANs), wide area networks (WANs) or the Internet, on the other hand splits data into packets and transmits them independently of one another. However, transmitting multimedia data over packet-based networks introduces problems if the temporal relationship between adjacent packets at the sender cannot be maintained and reconstructed at the receiver. The trend towards Voice over IP (VoIP) in recent years has raised a range of complexities, in particular, resulting from the lack of a common clock.
These problems are described with reference to FIG. 1, where two Internet telephony devices 10-A and 10-B comprising, for example, a standard PC or IP phone run respective telephony applications 14. These can be voice-only applications or can be voice and video applications. (For video applications, the device will also include a video card (not shown).) During a session, each application 14 sends and receives packets of multi-media information across the Internet 12 and temporarily stores the received packets of information in an associated application buffer 16.
In the case of voice information, a codec 18 takes received packets from the buffer 16 and decodes the packet information to provide more binary like information for storing in a receive portion of buffer 26 in an audio card 20 located in or associated with the telephony device. The audio card 20 then replays the received information through for example, speaker(s) 30 or headphones connected to the audio card 20.
Sound received from a microphone or headset 32 is recorded by the audio card and is stored in a transmit portion of the buffer 26. This is encoded by the codec 18 and transmitted to the receiver.
The receive portions of one or both of the buffer 16 and 26 are employed to counter the effects of the potentially highly variable delay rate for packets, known as jitter, caused by the Internet's best-effort service. These buffers absorb jitter by accumulating incoming packets, helping to ensure that playout is periodic and thus of good quality.
Each telephony device 10, typically contains a number of relatively low-grade oscillator crystals, among them the system clock crystal 24 to maintain system time, and an audio clock crystal 22, to set the sample periods for recording prior to encoding and for playback of decoded information. Such oscillator crystals can have inherent frequency errors greater than a few hundred parts-per-million resulting in accumulated errors of tens of seconds per day. For the purposes of the present application, the term “clock skew ” is defined as this difference in a crystal's actual oscillator frequency from its nominal frequency.
Although the rate at which voice is recorded for encoding by the sender and played out after decoding by the receiver is purely determined by the audio card clock, the system clock is also used if for example packet-delay measurements are required, which is often the case. As such, there are often four separate clocks contributing to the session, each with its unique skew as illustrated in FIG. 2.
The NTP protocol (Network Time Protocol) employs numerous primary and secondary servers available through the Internet that are synchronized to Coordinated Universal Time (UTC) via radio, satellite or modem. This protocol enables the synchronisation of system clocks 24 across the Internet. Alternatively, as disclosed in U.S. Pat. No. 6,360,271, GPS clocks can be used to synchronise system clocks. The effect of synchronizing the system clocks 24 is to eliminate the effects of the deviation of the respective system clocks from their nominal frequency, i.e. system clock skew.
Still, a number of skew-related problems can arise:
Firstly, and with reference to packets being transmitted from device 10-A to 10-B. If the sender audio clock 22-A operates faster than receiver audio clock 22-B, this will lead to packet accumulation in one or other of the receive portions of the buffers 16-B, 26-B. This results in higher buffer residency delays and possibly buffer overflow (packet loss). If the sender audio clock 22-A operates at slower than clock 22-B, it will result in underfill of one or both of buffers 16-B, 26-B. Of course, the same applies for audio clock 22-B and the buffer 16-A, 26-A. Thus, if the receiver audio clock rate differs from the sender audio clock rate, then the receiver buffer(s) will either gradually fill or empty.
Secondly, in order to absorb the effects of network jitter, many VoIP applications utilise adaptive buffering approaches. These applications need to estimate changes in one-way delays and react accordingly. Other approaches use synchronised time for precise per-packet delay measurement, see for example H. Melvin and L. Murphy, “An evaluation of the use of synchronised time within a hybrid fixed-adaptive playout VoIP application ”, Proceedings of IEEE Intl. Conference on Communications 2003, Anchorage, Ak., May, 2003 (Melvin et al). However, as outlined above, the rate at which packets are sent by the sender is solely determined by the sender audio card clock 22 (and not the sender system clock 24).
Again, with reference to packets being transmitted from device 10-A to 10-B, if the sender audio clock rate 22-A (which determines the rate at which packets are sent) is different from the receiver system clock 24-B (which timestamps packet arrivals to estimate delays), this will manifest itself in an apparent gradual increase or decrease in one-way delay. Thus skew between the sender audio card 22-A and receiver system clock 24-B will distort such measurements and thus the play-out mechanism and ultimately sound quality.
A number of approaches to resolving audio card clock skew between sender and receiver in a VoIP session have been proposed. O. Hodson, C. Perkins, and V. Hardman, “Skew Detection and Compensation for Internet Audio Applications ”, Proceedings of the IEEE Int'l Conference on Multimedia and Expo., New York, July 2000; and R. Akester, and S. Hailes, “A New Audio Skew Detection and Correction Algorithm ”, Proceedings of the IEEE Int'l Conference on Multimedia and Expo., Lausanne, August 2002 both disclose utilising a low level mechanism that measures audio skew by monitoring the data flow through the receiver-device i.e. audio card buffers 26-A, 26-B and thus involve low level programming and manipulation of audio card drivers.
Because, these approaches require low-level knowledge and manipulation of audio card hardware/software, although the concepts are universally applicable, implementation details will thus be product-specific. Additionally the mechanism used to measure audio skew is subject to ‘noise’ from network jitter and thus can return wrong results and thus respond inappropriately unless such noise is filtered out. Such filtering is a non-trivial problem.
According to the present invention there is provided a method for determining clock skew in a packet-based session. A sequence of control packets is received from a remote device transmitting media packets in a session, with each control packet including a remote real time-stamp and a remote media card clock time-stamp corresponding to the remote real time-stamp. A determination is made from two or more of said received control packets a first relative rate of a remote media card clock to the remote real time rate.