Multimedia data, for example comprising audio and video components, is useful for delivering content for entertainment and enhanced communications between remote parties. For example, video telephony systems are increasingly popular as a way to conduct meetings between persons at remote locations. As high-speed Internet protocol networks have become more widely available, lower cost video conferencing equipment has become available, making such equipment more accessible. However, because packet data networks send audio and video information separately, and because of various network effects, it is not uncommon for audio streams and associated video streams to become noticeably unsynchronized from one another. This problem is particularly apparent in connection with audio/video information that is transmitted across long distances over IP networks.
At present, most media gateways do not provide for the synchronization (or “lipsync”) of related audio and video streams. In addition, most media gateways do not use delayed compensation or lipsync buffering of some sort. Through real time-control protocol (RTCP), a mapping from real-time protocol (RTP) time stamps to network time protocol (NTP) time stamps/wall clock is possible. However, the information is not available at the right time to compensate for drift between audio and video streams. In particular, RTP time stamps available in the RTP header are not wall clock time stamps and the time stamps for audio and video need not necessarily start from the same count or follow a particular scheme. For example, audio time stamps may increase by 160 for every packet depending on the sampling and the packets for the video which belong to the same frame may not increase the time stamp value at all. Mapping RTP time stamps to NTP time stamps is not adequate, as the RTCP sender reports are not frequent enough and are not available at the necessary time for delay compensation. Furthermore, if there are multiple synchronization sources present, synching using RTCP is not practical. Accordingly, there is no clear solution for resolving lipsync issues in Internet protocol (IP) networks.
Certain products are available that claim to provide lipsync features through lipsync buffering. These may or may not depend on the RTP/RTCP time stamps available in the RTP/RTCP header. In some cases, especially in professional broadcast video solutions, lipsync based on phonetics and pattern recognition may be used. More particularly, systems have been proposed that detect an audio event in an audio portion of a media-program signal, and that measure the timing interval from the audio event to a subsequent video synchronization pulse in the video portion of the media program signal. The timing interval is stored in a third portion of the media program signal. At a receiving end, the timing interval information is retrieved, and used to align the audio event to the video synchronization pulse. However, such systems are relatively complex to implement.
Motion Picture Experts Group Standard 4 (MPEG4) provides sync and multiplexing layers. However, the sync and multiplexing layers are not actually implemented in most systems. In particular, those systems that support MPEG4 video usually implement only the compression layer part of the standard, and not the sync layer or the delivery layer. As a result, the MPEG4 compressed video RTP streams have no common reference count or time stamp when used with audio RTP streams, unless all the MPEG4 layers are implemented in the common framework. Implementing all layers in embedded endpoints is not cost effective, and in any event, many standards require other video codecs like H.261, H.263, etc. Accordingly, using the MPEG4 standard for synchronization in connection with multimedia calls placed over IP networks is impractical.