In a common video conference, generally a conference is held among ordinary conference sites under a multipoint control unit (MCU, Multipoint Control Unit), namely, ordinary conference sites in a conference are all connected onto a same MCU. However, as the conference capacity is increased or the networking is more and more complex, it is required to hold a cascade conference, namely, not only conference sites under each MCU participate in the conference, but also multiple MCU conferences are connected into a conference through a cascade conference site among multiple MCUs, so as to achieve the objective that multiple MCU conference sites hold a conference together. For example, a certain system needs to hold a nationwide conference, and the system has an MCU and a conference site in Beijing, each provincial capital, each prefecture-level city, and each county, so that the cascade conference of a nationwide range may be held, MCUs are arranged in Beijing, each provincial capital, and each prefecture-level city, respectively, and each conference site is connected to an MCU which the conference site belongs to. Because participating conference sites are numerous, and are dispersed at different places, each conference site is only required to be connected to the nearest MCU through a cascade conference, thereby reducing demands for a network.
A specific example is taken in the following to describe an existing method for processing cascade conference sites in a cascade conference. As shown in FIG. 1, in the cascade conference including telepresence conference sites, an MCU1 is connected to three conference sites which are telepresence conference sites T1 and T3 and an ordinary conference site T2, the telepresence conference site T1 includes three screens which are T1L, T1C, and T1R, and the telepresence conference site T3 includes three screens which are T3L, T3C, and T3R; an MCU2 is connected to three conference sites which are telepresence conference sites T4 and T6 and an ordinary conference site T5, the telepresence conference site T4 includes three screens which are T4L, T4C, and T4R, and the telepresence conference site T6 includes three screens which are T6L, T6C, and T6R.
It is assumed that each MCU supports reservation of audio data of two parties with the maximum voices, namely, the MCU at most selects audio data of two conference sites with the maximum voices from all connected conference sites (including the ordinary conference site, the telepresence conference site, and the cascade conference site) to perform sound mixing, and if there are less than two connected conference sites, the MCU selects data of all connected conference sites to perform sound mixing.
If the MCU1 and the MCU2 hold a conference in a cascade, a cascade audio channel is T12, and it is assumed that a cascade video channel is a video code stream which is the middle screen T1C of the conference site T1. The conference sound mixing processing is: in the MCU1 processing, it is assumed that the cascade channel is sound mixing of two parties with the maximum voices, and it is assumed that when the two parties with the maximum voices on the MCU1 are conference sites T1 and T2, a sound mixing code stream output by the MCU1 to the MCU2 through the cascade audio channel is T12=T1+T2. In the MCU2, it is assumed that when the two parties with the maximum voices are the cascade conference sites T12 and T5, if images displayed by the telepresence conference sites T4 and T6 are T1C, T5, T6R, and T4L, T1C, T5, respectively, a case of voices heard in the conference sites T4 and T6 is as follows:
The voice heard in the conference site T4 is T12+T5, namely, T1+T2+T5, and meanwhile because the conference site T4 is a telepresence conference site, images displayed by three screens of the conference site T4 are images of the middle screen, namely, T1C, of the conference site T1, T5, and the right screen T6R of the conference site T6, respectively. Because a user intends that the orientation of the seen image corresponds to the orientation of the heard voice in the conference site T4, namely, the voice of T1 is heard at the left, the voice of T5 is heard in the middle, and the voice of T6 is heard at the right, and because the voice of each conference site itself has a certain orientation, which is unnecessarily consistent with the orientation where the image is displayed, the MCU2 needs to process the voice heard in T4, namely, performs orientation adjustment on the voice of each conference site, adjusts the voice to the orientation of the corresponding image, and then, performs sound mixing to output the voice to the conference site T4, and in this way, the orientation of the voice heard in T4 can correspond to the orientation of the image.
The telepresence conference site T6 also has the same problem as that in T4, and it is also required to adjust the orientation of the voice (T12+T5) heard in T6, so that the adjusted orientation of the voice corresponds to the orientation of the seen image. For the conference sites T4, T5, and T6, because these three conference sites are directly connected to the MCU2, the MCU2 may directly process audio data of the conference sites, so as to adapt to orientation adjustment of the conference sites T4 and T6, respectively.
In the foregoing solution of the prior art, T12 is a cascade conference site, and the audio data is a result of sound mixing of the previous level MCU, namely, is the sum of data of the conference sites T1 and T2. Because both T4 and T6 display the image of the conference site T1C, but locations of the image are different, if the audio orientation of T1 is adjusted according to the location where the image is displayed in each conference site, because data of T1 and of T2 cannot be separated, that is, the voice orientation of T2 is adjusted simultaneously, and because orientations of the image of T1 seen in two conference sites are different, which inevitably causes that orientations of T2 heard in T4 and T6 are different, the effect of one-to-one correspondence between the image orientation and the voice orientation of each conference site in the cascade conference cannot be achieved.
It can be seen from the foregoing descriptions that, the audio data of the cascade conference site is used as the result of sound mixing of the previous level MCU, and the voice orientation of the audio data of the cascade conference site is usually not consistent with the location where the image of the conference site is displayed; because it is used as the result of sound mixing, the data cannot be separated, and for different display screens, when the audio orientation is adjusted, the audio corresponding to the display screen cannot be individually adjusted, while the result of sound mixing is adjusted in a unified manner, namely, the audio orientation which should not be adjusted is also adjusted, one-to-one correspondence between the image orientation and the voice orientation of each conference site in the cascade conference cannot be implemented, thereby reducing the user experience of a participant.