Telepresence is an advanced teleconference system, and enjoys great popularity among high-end users due to a true sense of on-scene presence thereof. In a telepresence system, auditory positioning, life size imaging, and eye contact directly concern whether the users can have an immersive sensation, and therefore are key technical indicators in evaluating the telepresence system. In a traditional video conference system, sound heard in each meeting room is the mixed and superimposed sound from several loudest meeting rooms in the entire conference, and each meeting room has only one sound input source and output, so that the users can not sense from which direction of the meeting room the sound is issued.
In a telepresence conferencing system, each meeting room is either a single-screen meeting room or a multi-screen meeting room. In the multi-screen meeting room, each screen shows an image of conferees within one spatial area corresponding to one stream of audio input. If it is to achieve the effect of auditory positioning, in the multi-screen meeting room, sound is issued from the direction in which the screen of the meeting room showing the image of a speaker is located, that is, the sound is made to follow the image. For example, in a three-screen meeting room, if a speaker seated on the left speaks, then the conferees should hear sound issued from the left side; if a speaker seated in the middle speaks, then the conferees should hear sound issued from in the middle; if a speaker seated on the right speaks, then the conferees should hear sound issued from the right side.
In this case, audio inputs/outputs from different directions need to be handled and mixed differently; it is obvious that a traditional method of single-stream audio mixing cannot be satisfactory in such a case. Meanwhile, in a multipoint conference with intercommunication between the single-screen meeting room and the multi-screen meeting room, a problem of how to mix and output sound from the single-screen meeting room and the multi-screen meeting room without affecting auditory positioning in both meeting rooms also needs to be solved. Further, if a way of respective transmission of multiple streams is adopted, it is very difficult to achieve rigorous synchronization among the multiple streams to meet the audio synchronization requirement of a video conference.