In a voice processing system, in order to support multi-party voice communication, a voice mixing process often needs to be performed on voices from multiple channels. Multi-channel voice mixing refers to a method or process for superimposing waveforms of voices from multiple channels upon each other, to form a single channel of voice. The simplest voice mixing is to directly add together all original waveforms of voices (e.g., pulse-code modulation (PCM) streams) from input channels to form one voice PCM stream after the voice mixing.
However, in a practical multi-channel voice mixing system, there are usually a large number of input channels that participate in the voice mixing. In this case, simply, directly adding together voice PCM streams from all input channels can cause a series of problems such as increased background noise and output overflow. Therefore, a multi-channel voice mixing system often selects inputted voices from a small number of channels (usually 2 to 5 channels) at a time for the voice mixing, according to a certain voice-mixing strategy (e.g., a first voice-mixing strategy), in order to minimize problems such as increased background noise and output overflow.
In a voice communication system, based on different locations for voice mixing, there are two mixing modes including server voice mixing and terminal voice mixing. The server voice mixing has relatively high voice mixing quality, but the voice mixing process consumes significant resources. Especially when there are a great number of voice users, the voice server can be overwhelmed. Terminal voice mixing can reduce resource load on the server, but has relatively low voice mixing quality and cannot meet the high quality requirements for occasions such as audio/video conferences.