In a multipoint conference service, voice data of each participant, which is encoded by a voice encoder, is transmitted to a multipoint conference server. The multipoint conference server transmits to every participant the voice data with the voices of the other participants than this one participant mixed.
When mixing the voice data, at first, voice signals of all the participants are calculated by adding all the decoded voice signals obtained by decoding the voice data of each participant. Next, the voice signals are obtained by subtracting own voice from the voice signals of all the participants, the voice signals are encoded and the generated voice data is transmitted to the respective participants.
As an example of a communication protocol between a terminal in a multipoint conference service and the server, ITU-T H.323 and H.324 are used in a circuit switching network, 3G-324M is used in a mobile network, and IETF RFC3550 RTP (Real-time Transport Protocol) is used in a packet network based on IP (Internet Protocol).
As the voice encoder, AMR (Adaptive Multi-Rate) method defined by G.711, G.729, and 3GPP TS26.090, AMR-WB (Wide Band) method defined by TS26.190, and an EVRC (Enhanced Variable Rate Codec) method defined by 3GPP2, that are the ITU-T standards, are used.
The G.711 method is to compress each sample of 16 bits in the voice signals sampled at 8 kHz to be 8 bits by using logarithmic transformation and in this method, calculation amount is small but compressibility ratio is low.
On the other hand, the G.729 method, the AMR method, and the EVRC method are based on a differential coding method according to the CELP (Code Excited Linear Prediction) principle and they can encode the voice signal more efficiently.
In the CELP, an encoder extracts a spectrum parameter showing a spectrum characteristic of the voice signal from the voice signal for every frame (for example, 20 ms) by using a linear prediction analysis (LPC: Linear Predictive Coding).
Further, the frame-divided voice signal is further divided into sub-frames (for example, 5 ms), parameters (a delay parameter and a gain parameter corresponding to a pitch period) in an adaptive code book are extracted based on a past sound source signal for every sub-frame, and the pitch of the voice signal of the corresponding sub-frame is predicted according to the adaptive code book. A most suitable sound source code vector is selected from a sound source code book (vector quantization code book) consisting of predetermined kinds of noise signals and a most suitable gain is calculated for a residual signal obtained through the pitch prediction, thereby quantizing the sound source signals.
The sound source code vector is selected in order to minimize an electric power error between a signal synthesized by the selected noise signal and the above mentioned residual signal. A combination of index, gain, spectrum parameter, and parameter in the adaptive code book, indicating the kind of the selected code vector is transmitted as the voice data.
A decoder calculates a sound source signal and a synthetic filter coefficient in the linear prediction analysis from a parameter obtained from the voice data and the sound source signal is driven through the synthetic filter, thereby obtaining the complex voice signal.
A voice mixing method is disclosed (refer to Patent Document 1) in which comparison/selection processing is not performed for every sample and a plurality of samples following the sample of the selected voice data are selected based on the result of one comparison/selection processing in size in the samples.
Further, a voice mixing method is disclosed (refer to Patent Document 2) in which a total signal is once generated in a mixing unit, its own voice information (voice information transmitted by one user) is subtracted from the total signal, and the voice information of other than the user is returned to itself.
A communication control unit is disclosed (refer to Patent Document 3) in which a voice synthesis unit adds each voice data converted into the linear data by each heterogeneous encoding/decoding unit, after that, voice data is generated by subtracting the own voice from the added voice data, and it is transmitted to the corresponding heterogeneous encoding/decoding unit.    Patent Document 1 Japanese Patent Publication Laid-Open No. 2005-151044 (paragraph 0014, 0016 and 0045)    Patent Document 2 Japanese Patent Publication Laid-Open No. 2005-229259 (paragraph 0003 and FIG. 1)    Patent Document 3 Japanese Patent Laid-Open No. 6-350724 (paragraph 0020 and FIG. 2)
In a multipoint conference system in the related art, the voice with the voices of all the participants other than the self participant mixed is encoded and transmitted to every participant. At that time, since the amount of calculation through voice encoding increases according to an increase in the number of participants, the system uses a method for detecting each speaker who is uttering and restricting the number of voices to be mixed, thereby reducing the number of voice encoders to be operated.
In the case of using a voice encoder performing a differential coding like the CELP method, since an inconsistency occurs in a memory showing the condition of the encoder when switching the encoder according to a change of the speaker, there is a problem that abnormal sound occurs in a decoded voice.
Means for solving the problem are not disclosed in the above Patent Documents 1 to 3.