A communication, which includes both video data and associated audio data, using multiple media is becoming increasingly important in the communications industry, both for fixed and mobile access. The traditional speech telephony is more and more often being upgraded to include a video component (i.e., video data), resulting in the opportunity for users to communicate using so-called “video telephony.”
The video data associated with a video telephony call is typically created by a video camera in the sending device. The sending device may be a portable device, such as a mobile phone. Sometimes the user orients the sending device so that the camera is positioned to show the speaker's face. However, the camera may be used to show other things, which the user finds relevant for the conversation, for example a view that the user wants to share with the person that she or he is talking to. Thus, what is shown during a communication session can change. In this context, the video data and the audio data are usually generated having a logical connection, e.g., a speech of a user is associated with a video of the face of the user that corresponds to the user generating the speech.
When the speaking user is also shown on the listening user's screen, it is desirable that the audio and video data are synchronized so that the user experiences a good coordination between the sound and the video. The lip movements of the user shall normally be in synch with the sound from the device's speakerphone to achieve the good coordination. This provides a connection between the lip movements and the heard words, as it would be in a normal discussion between two people at short distance. This is referred to herein as lip-sync or logically related audio and video data.
Hence, in the existing services, such as 3 G circuit-switched video telephony (see for example 3GPPTS26.111, which is incorporated by reference herein, from 3GPP standard group, ETSI Mobile Competence Centre 650, route des Lucioles 06921 Sophia-Antipolis Cedex, France) and emerging IP multimedia services such as IMS Multimedia Telephony (see for example 3GPP TS 22.173 and ETSI TS181002 from ETSI) the support of inter-media synchronization is desired. The traditional methods to achieve synchronization between audio and video are discussed next. For Circuit Switched Multimedia, there can be provided an indication of how much the audio shall be delayed in order to be synchronized with the video (see ITU-T H.324). For services that are transported on Real-time Transport Protocol (RTP, see IETF RFC3550), RTP timestamps together with RTP Control Protocol (RTCP) sender reports can be used as input to achieve the synchronization (see IETF RFC3550). However, some existing multimedia communication services do not provide any media synchronization, resulting in a poor user experience when lip-synchronization is needed.
The systems that are synchronizing the audio with the video typically delay the audio data by a certain amount of time until the video data is decoded, and then both data are played simultaneously to achieve the desired lip-synchronization. However, this synchronizing method is unpleasant for users due to the increased delay causing long response times and problems for the conversation. For example, the video data typically has a longer delay from the camera to the screen than the speech has from the microphone to the speakerphone. The longer delay for video data is caused by longer algorithmic delay for encoding and decoding, often a slower frame rate (compared to audio data), and in some cases also by longer transfer delay due to the higher bit rate. Assuming that the receiving device synchronizes audio and video, the device has to delay the audio data flow before playing it out. This naturally causes a reduced user experience of the speech, which in turn hampers the conversational quality. For example, when the delay of the audio data exceeds a certain limit (about 200 ms), it starts to impact the conversational quality. First, there may be some annoyance of the user because, the other speaker seems to react slowly, and sometimes both speakers start to talk simultaneously (because they will notice this problem only after some time delay). If the delay is large (e.g., over 500 ms), it starts to be difficult to keep up a normal conversation. Thus, one cause of the dissatisfaction of the speakers using video telephony is that the response time of the other speaker is too long, unlike in a normal face-to-face or speech telephony conversation.
Accordingly, it would be desirable to provide devices, systems and methods for audio and video communications that avoid the afore-described problems and drawbacks.