With the popularity of televisions, users have increasingly high requirements for a size of a television screen. Some video communication systems even adopt a projector or a television wall for displaying. In this case, if one picture is synthesized by at least two sub-pictures, positions of speakers in different sub-pictures are quite different when the requirements for a screen size are low. However, in a current multimedia communication system, positions where the sounds are made are not varied according to the changes of the positions of the speakers. This results in the mismatch between position information of the sound and the sub-picture, and thus affects the reality sense of video communication.
In the conventional art, a video conference system includes devices such as a Multipoint Control Unit (MCU), a single sound channel terminal, a multi-channel terminal having at least two sound channels. After the terminals and the MCU are connected, the terminals report configuration situations including positions and quantity of loudspeakers to the MCU. The MCU allocates quantity of sound channels for each terminal according to the configuration situations of the loudspeakers in the terminal. For example, if the terminal only has one loudspeaker, a single channel is allocated. If the terminal has two loudspeakers, two sound channels are allocated. If the terminal has four loudspeakers, four sound channels are allocated. During the conference, the MCU receives video streams and audio streams from each endpoint, combines the video streams into one multi-picture, and sends the multi-picture to the terminals. The audio stream is generated according to the configuration situations of sound channels in each terminal. For example, a terminal 1 has four sound channels, so that four audio streams are generated for the terminal 1, and each audio stream corresponds to one loudspeaker of the terminal 1. The audio stream is usually generated in a manner of adjusting amplitude and time delay. After being processed in such a manner, in the terminal 1, it is felt that the sound is produced from a position of a speaker in the picture, and thus a sense of the position information of the sound is produced.
In the research and implementation of the conventional art, the inventor found that the conventional art at least has the following problems. The MCU has to learn about the configuration situation of loudspeakers before generating a corresponding number of audio streams according to the number of the loudspeakers. In this way, a relationship between the MCU and the terminals becomes too close, thus causing insufficient flexibility.