The present invention relates generally to data transfer and particularly, to a method for handling a larger number of people per conference in voice conferencing over packetized networks.
Conference calling, such as a conference by telephone and other like audio and/or visual device in which three or more persons in different locations participate by means of a central switching unit, enables participants in widely dispersed geographical areas to communicate in an efficient manner in real time. Because of the great utility provided by conference calls, the use of this method of communication has made its way into many aspects of modern life, connecting home users, wireless users, business personnel, and the like, to enable multiple users the ability to communicate with each other at the same time. In this way, a group of people may communicate directly without requiring the participants to physically travel to the same location. However, a conference call may encounter a large quantity of background noise thereby reducing the quality and utility of the conference call.
Therefore, when mixing voice streams from multiple participants in a conference call, it is desirable to reduce background noise within the conference call as well as reduce computational resource requirements required in providing the call. Previous methods utilized to correct for background noise involved outputting to each participant the gain corrected sum of all voices, outputting to each participant the gain corrected sum of the voices of all other participants, and outputting only the loudest speaker to each participant.
While outputting to each participant the gain corrected sum of all voices may be acceptable in circuit switched networks, in which delays are low and participants can not hear their own voice due to compensation by the human communication channel and brain of the participant, such a method is not feasible in a packetized network. For instance, in an environment where voice is transported over a packet network, the delay may be larger, so that participants may be able to hear their own voice, recognized as a disturbing echo. Such an echo is typically too strong to be removed utilizing normal echo cancellation, and further, requires extensive resources, as such removal may be computationally expensive as the echo tail may be quite long, such as greater than 60-160 milliseconds (ms).
Outputting to each participant the gain corrected sum of the voices of all other participants adds, in addition to the voice of active participants, background noise for “silent” participants. Thus, as the number of participants increase, the background noise from “silent” participants also increases, thereby lowering the quality of the communication. Additionally, this technique is computationally expensive, since it may be necessary to perform a time add of (n−1) voices for each participant, n being the number of participants.
Further, outputting only the loudest speaker to each participant generally suffers from insufficient voice quality. For example, in conference calls with high interactivity, switchovers between participants may be disturbing to the participants. During a switchover between loudest participants, information from one participant may be lost, thereby affecting the continuity of the call and the overall experience. Moreover, situations may be encountered within the call in which more than one speaker may wish to speak at the same time. In such a situation, one of the inputs would not be provided to the other participants, and the originating participant may not even know if the output was transmitted.
Other techniques previously employed were insufficient due to a variety of reasons. In a Voice Over IP system that does not employ a multipoint control unit, each endpoint sent, in multicast, the data from that endpoint to other endpoints. Thus, each endpoint received several voice streams and had to mix them. This resulted in limitations in the number of people due to computation constraints, such as limiting the number of participants to 3 or 4. In a Voice Over IP system with a multipoint control unit, each participant had their voice stream sent to the multipoint control unit. The voices of the participants were then mixed, and the result sent individually to each participant in the conference. This technique rapidly saturates the network and significantly loads the IP stack in the multipoint conference unit. For instance, the multipoint conference unit may have to send the result of the mixing separately to each participant, thereby limiting the size of the conference. An additional solution to provide very large conferences involves only allowing one person to speak, thereby limiting the other participants to only listening to the content, in effect working as a broadcast rather than a conference.