This invention relates to apparatus for conducting a video conference and to a method of conducting a video conference.
Referring to FIG. 1 of the drawings, one typical implementation of video conferencing among several sites involves use of an A/V terminal T at each conference site and a single multi-point control unit, or MCU. The several conference sites are spatially separated to a greater or lesser degree and are all connected to a network. The MCU is also connected to the network. Although FIG. 1 shows the MCU at a different network site from the conference terminals, in practice the MCU might be at one of the conference sites and the terminal at that site might be connected to the network through the MCU.
Referring to FIG. 2, each A/V terminal includes a microphone 2, a loudspeaker 6, a camera 10, a monitor 14, an encoder/decoder (CODEC) 18/20, and a network interface driver 24. The microphone and camera acquire audio and video signals, which are then digitized, and the encoder 18 encodes the digital audio and video signals in accordance with appropriate compression protocols, such as MPEG 1 and MPEG 2, and outputs a standard audio-video MPEG transport stream (MTS). The network interface driver 24 receives the MPEG transport stream and creates audio-video IP packets {AV}, where the braces { } designate encapsulation of the MTS packets in IP packets. The IP packets that are derived from MTS packets are referred to herein as AV IP packets in order to distinguish them from other IP packets. Each AV IP packet typically contains seven MTS packets. The MCU sends requests for AV IP packets to the different terminals over the network. The terminals respond to the packet requests by sending the appropriate AV IP packets onto the network, and the network routes the AV IP packets to the MCU.
Referring to FIG. 3, the network interface driver 28 of the MCU receives the AV IP packets provided by the terminals T respectively and routes the four MPEG transport streams recovered from the AV IP packets to respective decoders 321-324. Each decoder 32 decompresses the MPEG transport streams received from the corresponding terminals to generate a terminal video signal VIN and a terminal audio signal AIN, which it supplies to an audio/video processor 36.
The A/V processor combines the input audio signals A1IN-A4IN to generate output audio signals A1OUT-A4OUT for the terminals T1-T4 respectively and routes the audio signals A1OUT-A4OUT to the encoders 401-404 respectively. Normally, the audio signal that is supplied to the loudspeaker 6 at a given conference site will reflect the audio signals acquired by the microphones 2 at all the other conference sites. The A/V processor may generate the output audio signals by first combining all the input audio signals to create a common mix signal and then subtracting the input audio signal received from a given terminal from the common mix signal to create a mix-minus audio output signal for the given terminal. Accordingly, the output audio signal for terminal T1, for example, is composed of the signals A2IN-A4IN received from terminals T2, T3 and T4. In this manner, objectionable echo effects are reduced or avoided.
The A/V processor 36 creates output video signals V1OUT-V4OUT for the terminals T1-T4 respectively. In one implementation, the output video signals are all the same and represent a common conference picture. In the case of there being four conference sites, the A/V processor 36 may combine the several terminal video signals V1IN-V4IN to create a so-called quad split conference video signal, which represents a picture in which the four terminal pictures, represented by the four terminal video signals respectively, are displayed in respective quadrants of the conference picture. More generally, however, the output video signals may be different and depend on selections made at the respective sites. For example, the participant at site 1 (the location of terminal T1) might wish to view the picture acquired by the camera at site 3. In this case, the signal A1OUT is a combination of A2IN-A4IN and the video signal V1OUT is the same as V3IN.
Each of the encoders compresses the audio and video signals for the corresponding terminal and outputs a standard audio-video MPEG transport stream. The network interface driver of the terminal T1, for example, sends out requests for AV IP packets, and the network interface driver 28 of the MCU responds to a packet request by sending AV IP packets from the encoder 401 onto the network, and the network routes the packets to terminal T1. The network interface driver 24 of the terminal T1 receives the AV IP packets from the network and supplies the corresponding MPEG transport stream to the decoder 20, which decompresses the MPEG transport stream to generate the video signal V1OUT and an audio signal derived from the signals A2IN-A4IN received by the MCU from terminals T2, T3 and T4. The picture represented by the video signal V1OUT is displayed on the monitor 14 at the terminal T1 and the audio signal is played back through the speaker 6.
It will be appreciated from the foregoing brief description of one implementation of video conferencing that the conventional hub and spoke system requires that substantial audio and video processing be performed at the MCU. For example, in the case of the example that has just been discussed, it is necessary to synchronize the four terminal video signals at the MCU in order to combine the terminal video signals and it is also necessary to synchronize the terminal audio signals with the corresponding terminal video signals in order to preserve lip sync. Further, since the MCU processes the audio and video signals that are acquired at the different conference sites, the MCU must include a CODEC for each conference site. Thus, for each conference site there must be both a site CODEC in the terminal and a central CODEC in the MCU. Moreover, the network connection to the MCU must have sufficient bandwidth to accommodate all the terminal MPEG transport streams, which may place a practical limit on the number of conference participants.
In addition, the conventional implementation places control over the conference picture in the hands of whoever controls the MCU, which might not always be optimum.
Imperfections in echo cancellation might not allow the mix-minus technique described above in connection with FIG. 3 to produce an audio signal that provides a natural sound on playback.