Video conferencing provides a way for people at distant locations to simulate a live face-to-face meeting. Video conferencing techniques generally call for broadcasting live ("real time"), two-way audio and video interactively between two or more remote sites. Generally, a computer, video camera, and speaker are employed at each site participating in a video conference. Video conferencing software executing on each computer manages the equipment and the video conferencing session. The session is interactive in that it allows participants to make changes to documents that others can see in real-time. A windows-based graphical user interface is generally employed so that live video feed can be seen by a user in one window, while other computer-generated images are displayed in other windows. The participating computer systems may be connected by any of various types of communication links, such as conventional telephone lines, otherwise known as Plain Old Telephone Service (POTS), a local area network (LAN), or Integrated Services Digital Network (ISDN) connections. Various standards exist to define video conferencing using such media. For example, International Telecommunications Union (ITU) standard H.320 is a specification which defines multipoint video conferencing over circuit switched media, such as ISDN. ITU standard H.323 defines video switched media, such as ISDN. ITU standard H.323 defines video communication on LANs, while ITU standard H.324 is directed to video and audio communication using POTS.
One problem encountered in video conferencing is that of synchronizing associated audio and video streams, i.e., synchronizing audio and video streams acquired concurrently by a camera and a microphone of a participating processing system. Synchronization can be difficult when the audio and video streams are processed independently in the transmitting or receiving system or both, as is generally the case. Typically, the audio and video data streams are processed by separate hardware subsystems under the control of separate software drivers. Hence, audio and video data from a given site are separated into separate data streams that are transmitted to separate audio and video subsystems at a remote site. Because the audio and video data streams are processed independently, there is often no explicit synchronization between these two recorded data streams.
The problem of synchronization, which is often referred to as "lip sync", is of particular concern in a video conferencing system that has the capability to record and play back audio and video. An example of such a system is the Intel Proshare.RTM. video conferencing system, which is available from Intel Corporation of Santa Clara, Calif. The Proshare.RTM. system includes the capability to record and then play back live audio and video received from a remote site during a video conferencing session. The synchronization problem is of concern in this context, because the video sequence may be played at a noticeably different speed than that of the audio sequence, due to the independent processing of the audio and video streams. This result is likely to occur if the recorded file does not contain original time stamp information for each frame in the stream, as is the case for a file recorded in the well-known Microsoft Media Player AVI (Audio Visual Interleave) format.
Synchronization problems tend to worsen in the context of receiving real-time audio and video data streams from a remote processing system, such as during video conferencing. One reason for this worsening is that transmitted audio and video data from one participating processing system tend to arrive at another participating processing system at unpredictable, irregular time intervals due to delays in the data channel and the processing load of the system. In a stand alone computer system running a playback application, such as Media Player, the audio/video lip sync problem might be solved by minimizing the latency between the start of playing the audio stream and the start of playing the video stream. However, in a real-time video conference, at least two factors contribute to difficulties of minimizing such latency.
First, the video stream tends to take an unpredictable amount of time to start playing. This time delay often cannot be compensated for, since the delay dynamically changes due to many factors, such as fluctuation in the processing load, transportation protocol, and video mode. Because audio is generally given highest priority in a video conference, the audio data stream normally has a constant data rate. Non-audiovisual data is often given the next highest priority, while video data is given the lowest priority. Consequently, the frame rate of the video stream may vary based on the above factors, while the audio frame rate does not. Second, because of the randomness of the machine load, even if the start of the audio and video streams is synchronized, the playing of the streams may gradually drift out of sync. The effect of a variable video frame rate may be perceived by a user as a jerky image, which may distract the user or otherwise degrade the perceived quality of the video conferencing session.
Another difficulty associated with recording a real-time transmission is that the audio and video frames are randomly delayed and may arrive at unpredictable time intervals. Such intervals are difficult to duplicate during playback without explicit time stamp information. Yet time stamp information may not be available. Furthermore, synchronization techniques based on time stamping tend to require explicit synchronization at the transmitting end.
Therefore, it is desirable to provide a technique for synchronizing audio and video streams without the need for time stamp information, in order to facilitate the recording and playback of audio and video streams in a video conferencing session or other real-time audiovisual transmission.