Telepresence is an advanced remote video conference system, which has been favoured by high-end users for the actual presence of the telepresence. In a telepresence system, sound-based localization, life-size, and eye contact are directly related to whether a user can have an immersive feeling and therefore they are very important technical indicators for measuring the telepresence system.
In a conventional video conference system, each conference site has only one video conference terminal, the video conference terminal encodes and transmits at least an audio or a video, and receives, decodes and outputs at least an audio or a video besides a secondary stream video. Since there is only one input source and output of a sound, the user cannot perceive the direction from which a sound is emitted in the conference site. Moreover, since there is only one input source and output source of the video, an captured and encoded pictures of a local end need to cover an overall picture of the conference site. In the case of a multipoint conference, pictures of only one conference site or stitched pictures of a plurality of remote conference sites can be selected for displaying, thereby the video transmitted or received cannot meet the requirement for displaying a life-size object.
In a telepresence conference system, a single conference site may have at least a plurality of audio or video input and output devices and in a multi-screen conference site, each screen displays pictures of agent participants in one position, correspondingly the agent participants in every position correspond to one audio input. By means of the azimuth information of the audio and the directional regional acquisition of a professional camera, sound-based localization and life-size displaying can be achieved, and realistic effect of eye contact can further be achieved.
However, existing telepresence systems are typically evolved from conventional video conference systems, and a multi-screen conference site is comprised of a plurality of video conference terminals and a plurality of audio-video peripheral devices. A plurality of video conference terminals of a conference site establish signaling connections and media logical channels with remote endpoints (may be video conference terminals or multipoint control units (MCUs)) respectively, finally transferring audio-video streams between pairs of endpoints and outputting a plurality of streams through a loudspeaker box and a display device which are separated with each other. This operation manner is cumbersome, and a plurality of video conference terminals are required in one conference site for processing signalings, and each terminal occupies an IP address, or an endpoint ID number (such as H.323 ID), or a conference number respectively, lacking a mechanism for mutual information processing between the terminals (such as agent information), and the synchronization between multiple streams is very difficult, which affects user experience.