Some of the new applications in data transfer systems relate to transport of media components between electronic devices. The media components can be e.g. continuous media streams that are transmitted in real-time to a receiving device. An example of such an application is See-What-I-See (SWIS).
SWIS is a way of communicating—currently—via a mobile network. SWIS communication comprises typically both audio and video components that are delivered from one device to other. The basic idea of SWIS is to make a phone call and simultaneously to send real time video data captured by, or otherwise provided by a sending device. This means that the receiving device can display the video to the receiver when the sender and the receiver are having the phone communication.
SWIS can be implemented in different ways. The audio can be transmitted over circuit-switched network and the video over packet-switched network. It is also possible to transmit both over the packet-switched network (e.g. in VoIP). In circuit-switched (CS) network digital data is sent as a continuous stream of bits, whereupon there is not hardly any delay in the transmission, or the delay is substantially constant. In packet-switched (PS) network digital data is sent by in short packets, which comprise the digital data to be transmitted.
Currently, data that is carried over packet-switched network is handled by using Real-time Transfer Protocol (RTP). RTP Control Protocol (RTCP) is based on the periodic transmission of control packets to all participants in a session. A primary function of RTCP is to provide feedback on the quality of the data distribution.
Synchronization methods for audio and images used e.g. in video conferencing can be found from related art. An example of synchronization for video conferencing is disclosed in EP1057337 B1 where sound and images are synchronized by detecting any mismatch between the sound and image outputs and adjusting a variable delay in a gateway on a signal routed through said gateway until the sound and image outputs are synchronized. In this publication a video device and an audio device are interconnect by a gateway, which acquires audio signals and video signals, which gateway is capable of determining a delay between audio and video signals. The synchronization is carried out by test signals and a calculated delays.
Synchronization of two RTP streams, e.g. an audio RTP stream and a video RTP stream, is done as follows. Each RTP packet contains a timestamp for the payload of the packet. The first timestamp of the stream is set to a random value due to security reasons, and timestamps are coded as clock ticks of the native frequency of the media (usually 90 kHz for video and the sampling frequency or its integer multiple for audio). An RTCP packet stream accompanies each RTP stream. Periodically, every few seconds or so, an RTCP sender report is generated and carries the wallclock time (NTP time) that corresponds to a particular RTP timestamp. The receiver then uses the RTCP sender reports to convert RTP timestamps to wallclock time and schedules the playout of media samples accordingly.
In the basic form of e.g. the SWIS application, there is a circuit-switched call ongoing, when the sending device decides to share video with the receiving device. A packet-switched video connection is established and video is transported over e.g. RTP/UDP/IP (Real Time Protocol/User Datagram Protocol/Internet Protocol) to the receiving device. As said, video packets are likely to face a different and unpredictable amount of transmission delay than the speech frames in the circuit-switched call. No information about how to synchronize the transported video to the speech is conveyed by the transport protocols. Therefore, the receiver cannot reproduce accurate audio and video synchronization.