1. Field of the Invention
This invention relates to communications systems and more particularly to a method and apparatus for synchronizing audio and video data in video conferencing systems.
2. Description of the Related Art
A video conference environment typically includes a plurality of conference sites, commonly referred to as endpoints, which are geographically separated but electronically linked together to enhance collaboration between and among individuals at the various conference sites. A video conference system attempts to replicate the interpersonal communication and information sharing which would occur if all the participants were together in the same room at the same time.
Two or more endpoints participating in a video conference are coupled together via digital communication lines. For example, in a point-to-point conference, two endpoints are coupled in a video conference. One endpoint dials the other endpoint directly to initiate the video conference. In a multi-point video conference involving more than two endpoints, each endpoint dials a multiple point control unit (MCU) which couples the endpoints in the same video conference.
Each endpoint typically transmits and receives conference information (e.g., audio and video information) to and from the other endpoint(s). Each conference site includes at least one source of conference information. For example, each endpoint has one or more audio and video sources from which to select for transmission to the other endpoint(s). When audio and video signals are transmitted, synchronizing presentation of the signals at the other endpoints is referred to as "lip synching" and is an important element in user satisfaction with video conferencing. For example, when a close-up image of an individual speaking at one endpoint is displayed on monitors at the other endpoints, it is desirable for the speaker's voice to match movement of the speaker's mouth.
Communication standards such as H.320, H.324, and H.323 for video conferencing systems use separate data streams for audio data and video data. Digital video images require more information to represent each frame of data compared to digital audio messages, and accordingly, more time is required to process video images compared to audio data streams.
A common technique for achieving lip sync in video conferencing systems is to use a table of static values representing the delay between the audio and video messages at various line rates. Currently, a user selects an audio delay setting from a property page on a system configuration property sheet. The property page allows the user to set delay values for both transmitted and received audio for each line rate, or to reset the list of values to default values. Controlling delay using a constant, static value is inadequate, however, because total delay is affected by line rate and video bit rate. Video bit rate is correspondingly affected by factors such as data rates, communication protocol, audio algorithm selection, and the number of data channels transmitted. These factors often change throughout the course of a video conference and it is therefore desirable for a video conferencing system to be capable of automatically adjusting the delays to keep the audio and video data streams synchronized.