Currently, videoconference products or some of the conference call products are primarily compliant with ITU-H.323 or ITU-H.320 for audio processing. The device that implements core audio switching and controls multiple conference terminals is Multipoint Control Unit (MCU). The MCU provides at least a Multipoint Control (MC) function and a Multipoint Processing (MP) function, and can perform audio mixing of multiple audio data. For example, in a conference call, the terminals of at least three sites communicate through the MCU simultaneously. Therefore, the MCU needs to mix the sounds sent by all the terminals into one channel, and send it to the terminal of each site. In this way, it is ensured that the terminal users of all sites communicate like in the same conference room although they are in different spaces.
Taking conference audio processing as an example, the audio processing process for audio communication performed by multiple terminals in the prior art is shown in FIG. 1:
Step 101: On the MCU, audio codec ports are allocated to the terminals that access each site respectively.
Step 102: After the call is initiated, each terminal sends the coded audio data to the MCU respectively.
Step 103: The MCU decodes the audio data sent by each terminal, and selects the audio data of the site which produces a larger volume of sound.
Step 104: The selected audio data is mixed into one channel of audio data.
Step 105: The mixed channel of audio data is encoded and then sent to each site terminal.
Step 106: The terminals on each site decode the received audio data.
In the prior art, an audio coding and decoding process needs to be performed once the audio data passes through the MCU after the terminal on each site sends audio data to the MCU until each site receives the mixed channel of audio data sent by the MCU.
The inventor finds at least these problems in the prior art: Once a coding and decoding process occurs, the audio distortion from terminal to terminal increases. When a multi-point conference based on an MCU begins, the terminal on the site needs to perform a coding process and a decoding process; on the occasion of MCU audio mixing, another coding and decoding process needs to be performed, so that the audio is distorted twice. When a multi-point conference based on two cascaded MCUs begins, the terminal on the site needs to perform a coding and decoding process; on the occasion of audio mixing by the two MCUs, two coding and decoding processes need to be performed, so that the audio is distorted three times. By analogy, once an MCU is added, the audio is distorted for one more time. Moreover, like the deducing of audio distortion above, it is easy to know that every process of coding and decoding increases the audio delay from terminal to terminal. Besides, for the site terminals that join a voice conference simultaneously, the MCU needs to allocate an audio codec port to each terminal. Especially, when there are many sites, the MCU needs to provide plenty of audio codec ports, which increases the cost of the multi-point conference.