1. Field of the Invention.
The present invention is related to the field of telephony using Voice over IP networks, and more specifically to devices, software and methods for teleconferencing over such networks.
2. Description of the Related Art.
Packet switched networks and related devices are becoming very efficient for voice communications. More specifically, two people can have a telephone conversation via a packet switched network using Voice over Internet Protocol (VoIP).
Often an encoder of the device of one person in a conversation includes a Voice Activity Detection (VAD) module. When the VAD module determines that the person is not speaking, it pauses transmitting sound, because that sound would be only background noise (also known as source noise). The pause conserves bandwidth, for as long as the user is silent. Instead of the full packetized audio stream, the encoder may occasionally transmit a Silence Indication (“SID”) packet, to indicate the connection is still open, but the user remains silent.
When this feature is activated, the one of the two who speaks will be hearing absolutely nothing. One result of this is that he may not know whether the line may have been disconnected. Not knowing is disconcerting, especially for those who are used to regular telephone lines, where some background noise can be heard faintly. The disconcerted speaker might feel compelled to interrupt the flow of conversation regularly, e.g. by asking the other person a question, to continue ascertaining whether the connection is still good.
This problem has been ameliorated in the prior art by generating and playing out, in addition to the voices, a faint background noise to the participants while the connection is open. The faint noise gives the participants the comforting knowledge that the line is still open, which is why it is also known as comfort noise.
The comfort noise is generated by sampling a snapshot of the actual background noise of one participant, and encoding parameters of it in the SID packet. The encoded parameters may include background noise level, or level in each of the frequency components that makes the background noise. Once the SID packet is received, background noise is generated, and played continuously to the other participant.
The generation of comfort noise by each participant presents problems when there is multi-party voice conferencing. These problems are now described, after a more detailed explanation of how voice conferencing works.
Referring to FIG. 1, an arrangement for a multi-party voice conference is shown. A conference bridge 100 is used to help conduct a multi-party conference between four network endpoints 122, 124, 126 and 128, corresponding respectively to User A, User B, User C and User D. Conference bridge 100 establishes, through an Internet Protocol (IP) cloud 110, respective VoIP connections 132, 134, 136, 138 with the four endpoints 122, 124, 126 and 128.
Each user can speak to all the others through conference bridge 100. Each endpoint 122, 124, 126, 128 generates an encoded packetized audio stream that is sent over the respective connections 132, 134, 136, 138 to conference bridge 100. Conference bridge 100 adds the received voices, and plays them to the participants, as is described below.
Conference bridge 100 includes a transcoding component 140. Transcoding component 140 includes a decoder 144 (also known as decoding portion 144), and an encoder 148 (also known as encoding portion 148). Transcoding component 140 preferably handles many different types of codecs (coder-encoder pairs), so as to be compatible with many different types of endpoints.
Decoder 144 receives four streams of packets 172, 174, 176 and 178 from endpoints 122, 124, 126, 128 respectively. The streams are channeled through decoder 144, which converts them into voice data.
Conference bridge 100 also includes a summing component 160, which encompasses a summer 164 (also known as adder 164). Summer 164 receives the voice data from decoder 144, and sums it into single streams of voice data, one for each user. Only a single such stream 180 is shown in FIG. 1, and that is for not complicating unnecessarily FIG. 1. Stream 180 is shown as receiving all the inputs to convey the main idea, while this may not be necessarily the exact configuration. In better applications, each stream is destined for one of the participants. That stream does not receive that participant's own input.
Encoder 148 receives stream 180, and encodes it suitably for each of the codecs of each of the endpoints. Encoder 148 thus outputs four streams of packets 192, 194, 196, 198 that are transmitted respectively to endpoints 122, 124, 126, 128 over the respective VoIP connections 132, 134, 136, 138. This way, every one of endpoints 122, 124, 126, 128 receives an aggregate of all the inputs.
In a multi-party conference scenario, one of the users is typically the active speaker, while the others are silent. In such a case, summer 164 may receive comfort noise from all the remaining speakers. Summer 164 may reject some of them, as being not loud compared to the speech of the active speaker.
If the active speaker pauses, or if there is silence by all the parties, then summer 164 receives only background noise from each of the channels. This is an undesirable situation for a number of reasons.
First, summer 164 always selects at least the loudest ones of the encoded background noises, and adds them for all the participants. Once these are added, they may be misidentified by encoder 148 as speech, not background noise.
Second, as the audio streams may be derived from different codecs, the encoding of the background noise levels may be mismatched. A low background noise level may dominate the background noise of the overall conference simply due to different encoding. The phenomenon is worse if that background noise were not the one intended to dominate.
Third, as the levels of comfort noise from each channel change, or if two happen to be encoded such that their results are very similar, the selection algorithm of summer 164 may hop from one channel to the other. The active speaker especially may hear pops, clicks, and gargling noises, which is annoying.