Conferencing capability is an essential part of any voice communication network. Wide-area conferencing facilitates group collaborations, such as between businesses, educational institutions, government organizations, the military, etc. Typical traditional conferencing techniques often rely on time division multiplexing (TDM) techniques to bridge and mix voice traffic streams. (TDM-based systems are fully conventional and well known to those of ordinary skill in the art.)
Recently, a great deal of effort has gone into Internet-based voice communication systems (commonly referred to as voice-over-IP systems) and in particular to the development of Internet Protocol (IP) based media severs, which can offer advanced and cost-effective conferencing services in such voice-over-IP environments. One of the key portions of an voice-over-IP based conferencing media sever is the audio signal mixer whose functionality is to mix a plurality of inbound voice streams from multiple users and then send back to each user a mixed voice stream, thereby enabling each user to hear the voices of the other users.
Traditionally, such audio signal mixing has been accomplished through the use of a straightforward mixing algorithm which merely combines (i.e., sums) all of the plural voice traffic streams together and then normalizes the aggregate signal to an appropriate range (in order to prevent it from clipping). This method has been widely adopted in the currently available conferencing systems because of its computational efficiency and implementation simplicity.
However, the voice quality of the mixed streams with such a simplistic method is often not acceptable due to various reasons such as, for example, differing voice levels, unbalanced voice qualities, and unequal signal-to-noise ratios (SNR) among different channels. In addition, when too many channels are mixed together (e.g., when too many users are speaking simultaneously), the listener cannot easily distinguish one particular speaker from the others.
Therefore, to limit the number of channels present at a time in the mixed signal, the functionality of a “loudest N selection” has been added to the above-described straightforward mixing algorithm. In this modified approach, the energy level of each inbound channel is estimated and is then used as a selection criterion. Those channels with energy above a certain threshold, for example, are selected and mixed into the output signal, while all of the other channels are merely discarded (i.e., ignored).
Although this modified method does in fact improve the perceptual quality of the mixed speech signal (by limiting the number of mixed channels), using the signal volumes as the selection criterion does not necessarily provide a high quality solution to the problem. High volume does not necessarily indicate the importance of the channel. For example, the use of this method may block important speakers with low voice volume. In addition, due to the inherent fluctuation of the energy estimation, the presence of a certain channel in the mixed signal may not be continuous and consistent (even though it should be). Thus, in general, the improvement in the quality of the mixed signal over the simple summing technique with use of this method is somewhat limited.