During point-to-point speech communication, speech is typically interchanged between the terminals of both sides, so that each of the terminals may hear the voice from the terminal of the other side. While in multi-point (i.e. more than 2 terminals simultaneously attending communication) communication, for example, in a conference telephone system or a video conference system, instead of the simple point-to-point speech switching as implemented in the above-mentioned point-to-point speech communication, an assistant network-side device, i.e. a speech-switching device, is needed to perform speech switching between multiple terminals so that speech can be exchanged freely between all the terminals, since there is a possibility that multiple terminals participate in a same communication simultaneously. Wherein, the speech switching between multiple terminals is generally based on the following principles:
1) Each terminal can hear the voice from the other terminals, and speech exchange can be implemented conveniently and freely;
2) The present terminal should not hear its own voice;
3) To obviate speech distortion, each terminal is generally permitted to hear only the voice of several terminals with relatively loud voice over the other terminals.
In general, since speech switching between multiple terminals is realized on the network side, use is made of the method of centralized processing by the speech-switching device. The speech-switching device receives the encoded speech data from each terminal, and outputs the speech-mixed encoded speech data after speech-mixing. Refer to FIG. 1, which is a diagram illustrating the whole procedure of speech switching between multiple terminals according to the prior art. In which, the terminal 1, terminal 2 . . . and terminal N input their own encoded speech data into a speech-switching device 10 provided at the network side respectively. The speech-switching device 10 will decode the encoded speech data from each terminal respectively and select the decoded speech data with relatively large speech energy (i.e. the above-mentioned voice of the several terminals with relatively loud voice), and perform different encoding process to the selected data, and then transmit the data to different terminals. For example, suppose that after the data is decoded and the speech energy of the data is calculated, the speech-switching device 10 shown in the figure selects terminal 1 and terminal 2 as the current terminals with relatively loud voice. According to the above-mentioned speech-switching principles, the speech-switching device 10 will transmit the data to the terminal 1 after the decoded speech data of the terminal 2 is encoded, enabling the terminal 1 to hear only the voice of the terminal 2. After the decoded speech data of the terminal 1 is encoded, the speech-switching device 10 will transmit the data to the terminal 2, enabling the terminal 2 to hear only the voice of the terminal 1. In addition, the speech-switching device 10 will also perform speech-mixing synthesis to the decoded speech data of terminal 1 and terminal 2, and perform the corresponding synthesis encoding process, then transmit the speech-mixed encoded speech data to the terminal 3, . . . , terminal N respectively, enabling the terminal 3, . . . , terminal N to hear the voice of terminal 1 and terminal 2 at the same time. Therefore, with the above-mentioned speech-switching process, multiple terminals may exchange voice freely with each other in the case that the terminals participate in a communication at the same time, thus achieving the multi-point speech communication.
At present, both the conference telephone system and video conference system are communication systems supporting multi-point speech communication. In these systems supporting multi-point speech communication, the specific method for speech switching between multiple terminals by the speech-switching device on the network side includes the steps:
(1) the encoded speech data of each terminal attending the same conference is fully decoded in real time, and the speech energy of each terminal is calculated in real time according to the decoded speech data of each terminal; wherein the general formula for calculating the speech energy of each terminal is:
            E      ⁡              (        t        )              =                            ∑                      n            =                          t              1                                            t            2                          ⁢                                            S              2                        ⁡                          (              n              )                                ⁢                                          ⁢          or          ⁢                                          ⁢                      E            ⁡                          (              t              )                                          =                        ∑                      n            =                          t              1                                            t            2                          ⁢                                        S            ⁡                          (              n              )                                                      ,
in which S(n) is the decoded speech data of each terminal; t1 and t2 are the starting time and ending time for calculating the speech energy respectively.
(2) the speech energy of the individual terminals attending the same conference is compared in real time according to the above-mentioned calculated speech energy of each terminal.
(3) several terminals with relatively large speech energy are selected as maximal-voice terminals in real time according to the above-mentioned comparison result of speech energy (the number of selected maximal-voice terminals is predefined by a relevant operator). The other unselected terminals receive the linearly superposed speech data of the decoded speech data of the several maximal-voice terminals, while each selected terminal receives the decoded linearly superposed speech data of the other maximal-voice terminals except itself, respectively, thereby the speech switching between multiple terminals may be implemented.
The whole processing procedure of speech switching between multiple terminals will be illustrated below. Now refer to FIG. 2, which is a schematic diagram illustrating the processing procedure for speech switching between the five terminals attending a conference according to the prior art. The terminal A, terminal B, terminal C, terminal D and terminal E in the figure are five terminals that perform speech communication with each other, wherein at time t, the main process of speech switching between the five terminals implemented by the speech-switching device on the network side includes the following steps:
1) First, the encoded speech data transmitted from each of the terminal A, terminal B, terminal C, terminal D and terminal E respectively is fully decoded, and the speech energy of each terminal is calculated respectively according to the decoded speech data, so as to obtain the speech energy value of each terminal.
2) The calculated speech energy values of the five terminals are compared with each other and the terminals with relatively large speech energy are selected. For example, at time t, suppose that the terminal A, terminal B and terminal C are determined to be the terminals with relatively large speech energy according to the calculated speech energy values, then the terminal A, terminal B and terminal C will be selected as the terminals with relatively large speech energy.
3) A speech mixing and switching is performed to the decoded speech data sent from terminal A, terminal B and terminal C according to the specific conditions for transmitting to different terminals:
For example, at time t, for the terminal A, terminal B and terminal C which have relatively large speech energy, the terminal A receives the linearly superposed speech data of the decoded speech data of the terminal B and terminal C; wherein the formula for linear superposition may be:SA=λB×SB+λC×SC;
in which λB and λC are weighting factors, and λB+λC=1;
SA is the linearly superposed speech data received by terminal A, SB is the decoded speech data of terminal B and SC is the decoded speech data of terminal C;
The terminal B receives the linearly superposed speech data of the decoded speech data of the terminal A and terminal C, in which the calculation method of linear superposition is similar to that of the terminal A mentioned above.
The terminal C receives the linearly superposed speech data of the decoded speech data of the terminal A and terminal B, in which the calculation method of linear superposition is similar to that of the terminal A described above.
The other terminals, i.e. the terminal D and terminal E, receive the linearly superposed speech data of the decoded speech data of the terminal A, terminal B and terminal C; wherein the formula for linear superposition is:S=λA×SA+λB×SB+λC×SC;
in which λA, λB and λC are weighting factors, and λA+λB+λC=1;
SA is the decoded speech data of the terminal A, SB is the decoded speech data of the terminal B and SC is the decoded speech data of the terminal C, S is the linearly superposed speech data received by the terminal D and terminal E.
4) The linearly superposed speech data is encoded in order according to the above-mentioned different linearly superposed speech data, and then the encoded speech data is transmitted to the corresponding terminals. For example, the linearly superposed speech data of the terminal B and terminal C is encoded, and then transmitted to the terminal A, so that at time t, the terminal A can hear the voice of the terminal B and terminal C, but will not hear its own voice. The linearly superposed speech data of the terminal A, terminal B and terminal C is encoded, and transmitted to the terminal D and terminal E respectively, so that terminal D and terminal E can hear the voices of the terminal A, terminal B and terminal C at time t, thus the speech switching results are in conformity with the above-mentioned basic principles of speech switching.
However, it can be seen that:
(A) The speech switching device has to fully decode the encoded speech data received from each terminal before calculating the speech energy of each terminal, then the linear superposition of the decoded speech is performed, and at last, the linearly superposed speech data is encoded and sent to each terminal. Therefore, a fully encoding/decoding operation has to be performed respectively for the data of each terminal, which will result in the waste of resources. Especially for large-capacity communication system with more terminals that participate in a conference, the waste of resources is more serious, thus affecting the communication efficiency and communication performance.
(B) In the communication systems supporting multi-point speech communication, an operator may usually predefine a particular number of terminals with relatively large speech energy to be selected during switching process in the speech switching device (typically, three terminals with relatively large speech energy may be selected). During the following speech-switching process of multiple terminals, the corresponding number of terminals with relatively large speech energy will be selected to implement linear superposition of decoded speech data according to the predefined number. In this way, in the case that the number of terminals at a certain time is smaller than the predefined number of terminals to be selected, one or more noises will be introduced, which results in the deterioration of the speech communication between the multiple terminals. Referring to the example shown in FIG. 2, in the case that a manager predefines that three terminals are to be selected as terminals with relatively large speech energy for the linear superposition of decoded speech data, and if at a certain time t, only terminal A and terminal B have speech, it still needs to select three terminals as terminals with relatively large speech energy for speech switching according to the presetting, thus one of the terminals C, D, and E will be selected randomly besides the terminals A and B. This selected terminal is equivalent to the introduction of one noise, which will be linearly superposed with the decoded speech data of the terminal A and terminal B respectively and transmitted to the corresponding terminals. As a result, the voice heard by the terminal A is the superposition of the voice of terminal B and a noise; the voice heard by terminal B is the superposition of the voice of terminal A and a noise; and the voice heard by terminal D and terminal E are the superposition of the voice of terminal A, the voice of terminal B and a noise. In summary, it is equivalent that a noise is heard by all of the terminal A, terminal B, terminal D and terminal E, thus the speech communication quality between the multiple terminals is deteriorated.