See What I See (SWIS) is a modern way of communicating via a mobile network. In SWIS, the communication comprises typically both audio and video. The basic idea of SWIS is to make a phone call and to simultaneously send real time video data describing the environment from which the phone call is made. This means that the receiver can view the video of the environment and the situation from which the sender is making the phone call.
SWIS can be implemented in different ways. The audio can be transmitted over circuit-switched network and the video over packet-switched network. It is also possible to transmit both over the packet-switched network (e.g. VoIP). In circuit-switched (CS) network digital data is sent as a continuous stream of bits, whereby there is not hardly any delay in the transmission, or the delay is substantially constant. In packet-switched (PS) network digital data is sent by forming data into short packets, which are transmitted. The packet-switched network can have unpredictable delays or packets may be even lost, whereby the sender cannot be sure when the data packet is received at the receiver.
Currently, data that is carried over packet-switched network is handled by using Real-time Transfer Protocol (RTP). RTP Control Protocol (RTCP) is based on the periodic transmission of control packets to all participants in a session. The primary function of RTCP is to provide feedback on the quality of the data distribution. This feedback is performed by sender and receiver reports.
The sender report (SR) comprises a synchronization source identifier (SSRC) for the sender of the SR in question. In SR there is a NTP timestamp, which indicates the time when the SR report was sent. RTP timestamp corresponds to the same time as the NTP timestamp, but in the same units and with the same random offset as the RTP timestamps in data packets. SR also comprises information about sender's packet count, which is the total number of RTP data packets transmitted by the sender since starting transmission up until the time the current SR was generated. Sender's octet count in SR tells the total number of payload octets transmitted in RTP data packets by the sender since starting transmission up until the time the current SR was generated.
The receiver report (RR) also comprises a source identifier (SSRC_n) to which the information in the receiver report pertains. RR also comprises fraction lost, which describes the fraction of RTP data packets from source SSRC_n lost since the previous SR or RR packet was sent. This fraction is defined to be the number of packets lost and that divided by the number of packets expected. A cumulative number of packets lost in RR is the total number of RTP data packets from source SSRC_n that have been lost since the beginning of reception. This number is defined to be the number of packets expected minus number of packets actually received, where the number of packets received includes any which are late or duplicates. The number of packets expected is defined to be the extended highest sequence number received, that contain the highest sequence number received in an RTP data packet from source SSRC_n, and the sequence number with the corresponding count of sequence number cycles. RR also comprises interarrival jitter, which is an estimate of the statistical variance of the RTP data packet interarrival time. Also RR comprises last SR timestamp and delay since last SR (DLSR), which is the delay between receiving the last SR packet from source SSRC_n and sending the current reception report block. If no SR packet has been received yet from SSRC_n, the DLSR field is set to zero.
The problem with SWIS and with packet-switched network is due to the unpredictable delays, whereby the sender cannot be sure of the accurate time when the media packet is received by the receiver. In the context of video, the sender cannot thus know, what part of the sent media is visible at the receiver at a given time. Therefore, in SWIS the user's speech might be received significantly earlier than the video, whereby the recipient does not necessarily know, about what the sender is talking about if the sender is referring to something shown in the video transmission.
Synchronization methods for audio and images used e.g. in video conferencing can be found from related art. Mainly these methods relate to so called “lip synchronization”. In EP1057337 B1 sound and images are synchronized in real-time multimedia communication by detecting any mismatch between the sound and image outputs and adjusting a variable delay in a gateway on a signal routed through said gateway until the sound and image outputs are synchronized.
However, in SWIS the problems relate more to the fact that the sender is not aware exactly what part or phase of the media the receiver is currently receiving and observing. Therefore, there is still clear need to provide improvements in this kind of communication to improve the overall information exchange and mutual understanding between the human sender and receiver parties.