In recent years, with the development of communication technologies, video conferences have found broad applications. Video conferences commonly refer to TV conference services. Through multimedia communication means, a conference is held by using television equipments and communication networks, so as to provide an interaction of images, voices and data simultaneously between two or more geographical locations. As shown in FIG. 1, a video conference is generally composed of video terminals (i.e., conference terminals), a transmission network and a multipoint control unit (MCU).
Video terminal equipments mainly include video input/output equipments, audio input/output equipments, a video codec, an audio codec, information communication equipments and multiplexing/signal distributing equipments etc. The video terminals have a basic function of performing compression coding on the image signals shot by local cameras, sound signals captured by microphones, and transmitting the signals to a remote conference site through the transmission network. At the same time, the video terminals receive signals from the remote conference site, reduce the signals to analog images and sound signals after decoding. The processing of the audio signals is shown in FIG. 2 and the description thereof is given below.
In order to form a complete TV conference system, the video terminal equipments and the MCU have to be connected together through the communication network, and transmitting channels may be in the form of optical fibers, electric cables, microwaves or satellites.
The MCU is a control core of the video conference. When the number of the conference terminals participating in the conference is more than two, a control through the MCU is necessary. All conference terminals need to be connected to the MCU through standard interfaces. The MCU is realized according to protocols such as international standard H.221, and H.245. The MCU mainly functions to provide the mixing and exchange of images and voices and the control of all conference sites.
The MCU processes the audio data to provide sound mixing of multipoint conference sites, and the conference sites participating in the sound mixing are the conference sites with higher volumes among the multipoint conference sites. If a three-point sound mixing is to be realized, the conference sites participating in the sound mixing are three conference sites with largest volumes among the multipoint conference sites. A sound mixing policy is introduced as follows.
1) When a speech is given from one conference site, the speaker at the conference site may not hear its own voice while participants at all other conference sites may hear the voice of the speaking conference site.
2) When a speech is given from two conference sites, both of the speakers at the speaking conference sites may hear the voices of each other, but may not hear their own voices, while participants at all other conference sites may simultaneously hear the voice of the two speaking conference sites.
3) When a speech is given from three or more conference sites, the three conference sites having the largest volumes participate in the sound mixing. As shown in FIG. 3, T1, T2, T3 are the three conference sites having the largest sound volumes among the current conference sites, speaker at any one of the three conference sites may hear the voices of the other two conference sites, for example, the speaker at T1 conference site may hear the voices from T2 and T3 conference sites, and the speakers on all the other conference sites may simultaneously hear the voices from all three conference sites.
When a conference is held employing current video conference system, the processing of sound by the conference system is shown in FIG. 2. The data of speaking conference site is sent to the MCU after being encoded, the MCU performs sound mixing process on the sound data of the speaking conference site and sends the processed sound data to other conference terminals, and the participants on the other terminals may hear the sound of the speaking conference site after decoding the sound data. In the current video conference system, the MCU and the video terminal process the sound data of a certain conference site as one flow of data. When only one language is spoken in the entire conference system, the communication among multiple conference sites may be performed smoothly. But when two or more languages are spoken in the entire conference system, obstacles in language communication among participants may occur. To solve this problem, the conventional art utilizes the following two solutions. The following description takes a conference mixing Chinese and English as an example where the participants at one of multiple conference sites speak in English.
The first method is that all other Chinese conference sites are allocated with their own interpreters respectively to interpret English into Chinese for their own conference sites in order to understand the spoken contents of the above English conference site. As a result, if the conference scale is relative large, many interpreters are required, thus leading to waste of personnel. Moreover, when the interpreter on each Chinese conference site is interpreting, the microphone delivers the interpreted data to other conference sites, therefore the scene may be chaotic and is not feasible in practice. If the microphone is set not to deliver the interpreted data to other conference sites, the speaking conference site does not know whether the interpretation on Chinese conference sites is completed or not, and may not control its own speaking speed, thus resulting in a poor quality of the entire conference.
The second method is that a conference site is assigned as a dedicated interpreting terminal to interpret the speech of the conference sites participating in the sound mixing. But the solution also has disadvantages. If English is spoken at a conference site, and the interpreting terminal interprets English into Chinese, participants at each of the conference sites may hear English first followed by Chinese. But in fact, participants on the Chinese conference sites do not need to hear the English speech, and participants on the English conference sites also do not need to hear the interpreted Chinese speech. This causes the participants to have to hear much undesired information. Chinese and English are mixed, thus causes conference disorders, and the participants quickly become fatigued. In addition, the interpretation may slow down the conference's pace, and reduce the efficiency.
Considering the case in which three or more languages are spoken in a conference, and the languages are spoken concurrently at multiple conference sites, the above two solutions cause poor conference quality and are not practical given the effect of sound mixing.