In an audio conferencing system, several participants are connected to a conference bridge. The conference bridge will handle admission control of participants, conference control functions etc. When an audio conference is ongoing, the conference bridge performs media processing in order to receive audio signals from the participants, mix the audio signals to a total signal that will be transmitted to the participants (with the exception that its own signal will be subtracted to avoid echo).
In general, a conferencing system should be scalable, i.e. the hardware that runs the conference bridges should be able to handle several conferences and a great number of participants at the same time. The usual behavior in an audio conference is however that a maximum of 2 or 3 people talk at the same time. Also, the number of people that are allowed to talk at the same time needs to be limited in order for the conference to be meaningful for a listener. Therefore, the logic for controlling the mixing of the audio signals is advantageously designed such that a certain maximum number of active participants is allowed at the same time for a specific conference. The resulting total mixed audio signal will be calculated from these active participants. An active participant will receive this total mixed signal after its own signal has been subtracted to avoid that the participant hears his own voice. All other participants will receive and hear the total mixed signal. In this manner only a few distinct signals need to be transmitted. This saves complexity both in mixing and encoding.
Further, it is desirable to maximize the number of audio channels to mix, even if the current number of active participants are low. This is because mixing of too many channels, of which some only contain background noise, will degrade quality, as it will degrade the signal to noise ratio of the resulting mixed signal.
The present invention addresses the problem of how to select audio channels when mixing the corresponding audio signals to a resulting mixed audio signal.
EP 0 995 191 discloses mixing of multiple concurrent audio streams. Each stream comprises a sequence of frames and a subset of specific frames to be mixed is selected from the concurrent frames. The selection involves ranking the concurrent frames in order of importance and then selecting the most important frames. The ranking is based on a quantity inherent in each of the concurrent frames, such as its energy content. Selection can also be based on a combination of energy content and priorities assigned to the respective streams.
One problem with this prior art is the difficulty for a new audio stream to be included in the mix of audio streams. For example, consider a speech conference in which a new user wants to participate. If the audio stream of the new user is not allocated a high enough ranking, due to its low energy content or due to the low priority of its audio stream, other audio streams having higher ranking will prevent the new participant from easily joining the conference.
Another problem with the above described prior art is that such a scheme for mixing audio streams in certain common situations will result in an annoying switching behaviour in the background noise. This problem will be output signal. This will result in a more natural mixed output signal, due to the absence of unnecessary changes of inactive channels to be mixed. This can be compared with a system in which a certain criteria determines what channels to mix, e.g. an energy criteria. In such a system an inactive channel will often be changed for another inactive channel due to, e.g., a higher energy content of the background noise of the latter, or some other criteria better fulfilled by the latter inactive channel. This in turn will result in annoying switching behaviour in the background noise of the mixed output signal. Alternatively, such a system may choose not to include the inactive channel at all in the mixed output signal, which also will result in a less natural mixed output signal.
Preferably, when an active audio channel becomes inactive, that channel is moved below the active channels in the stack. As a result any active channel which was located just below the threshold level will then become part of the mixed output signal, as it will move one position up in the stack and be positioned above the threshold level. Again, if the mixing stack has more channels above the threshold level than the number of currently active channels, the channel that has become inactive will still be part of the mixed output signal, and any unnecessary switching behaviour in the background noise will be avoided.
Even though one apparent application of the present invention is a speech conference system, the skilled person will appreciate that the idea behind the present invention, as well as its implementation, is suitable for any application where there is a need to select what audio channels to mix among a multiple number of audio channels, such channels conveying speech, music or any other kind of audio, and then obtain a mixed audio signal to be output to a desired destination, such as to a loudspeaker, a recording device, back to one or more of more fully understood upon study of the following disclosure of the present invention.