With the development of communications technologies, people already can communicate with a conference participant in another conference site in real time through a video conference system. A typical video conference system is formed of a Multipoint control unit (MCU) and terminal devices. Generally, one site is configured with one terminal device, and one MCU is connected to multiple sites. A function of the terminal device is to collect sounds and images in a site, process the sounds and images, and transmit, through a network, the sounds and images to an MCU device connected to the terminal device; at the same time, the terminal also receives data of other sites sent by the MCU connected to the terminal device. A function of the MCU is to send, to the terminal device, audio signals received from other sites. However, limited by a device cost and bandwidth, in the prior art, the MCU does not send audio signals of all other sites to the terminal; instead, the MCU selects some audio signals according to a certain method and performs audio mixing, and then sends the mixed audio signals to the terminal.
A method in the prior art is that an MCU receives audio signals from all sites, selects a predetermined number of sites from all the sites according to a volume sequence of the sites, and performs audio mixing. In this case, even if main sound source objects concentrate in one site, audio streams of other unnecessary sites still need to be mixed, and too many unnecessary sites are involved in audio mixing; consequently, sound quality after the audio mixing is degraded, and unnecessary computing resources are consumed.