In prior art intercom or conferencing systems, audio signals travelling between each endpoint and the intercom server are represented as channels. Each endpoint traditionally carries one channel of audio to the server as well as receives one channel of audio from the server.
Channelizing audio in a conferencing system causes the need to mix all active participants of the conference before transmitting the audio to each endpoint. For an intercom system for which each endpoint have the flexibility to decide who they are listening or talking to, the mixing is very computing intensive as each channel will have completely different listening experience. To provide such flexibility in a traditional intercom system, each participant's audio channel must be present at the server at all times and thus imposing hard limits of the number of endpoints rather quickly.
A side effect of mixing is the addition of extra propagation delay. In order to mix audio, all channels must be timed together which means in packet based system such as IP, the need for jitter buffers at the server. Moreover, mixing can only be done using linear non encoded signals, meaning that all signals must be decoded before being mixed and then re-encoded after mixing, thus degrading substantially the quality of the signal.
Referring now to FIG. 1, as an example, a conference bridge topology 3, will require each conference participant 5 to send its unidirectional audio stream to a local conferencing bridge 7. The local conference bridge 7 will provide each participant 5 as well as other connected conference bridges 7 with their own audio mix composed of all participants 5. This topology 3 is bandwidth efficient as only one egress and one ingress signal needs to be sent to each participant 5. This topology 1 however requires lots of expensive processing resources at the bridge 7 to provide instant and dynamic multi conferencing capability. For example, as shown in FIG. 1, supposing that Participant F leaves the conference but that Participant A wishes to continue to listen to participant F in parallel with the conference, Conference Bridge 2 would have to send Participant F's audio to Conference Bridge 1.
The resulting audio signal for each participant 5 is a composite sum of signals provided by each party forming the union of the conference being monitored. For each audio signal arriving at the conference bridge 7, the following tasks must be performed: a) decompressing the signal; b) calculating the composite sum of all parties being monitored; and c) recompressing the resulting signal.
The significant amount of computational resources necessary to mix and compress lowers the total number of possible participants 5 available on one conference bridge 7 and degrades voice quality.
Due to the packet based nature of the transmission, it is necessary to do jitter buffering at the conference bridge 7 to align all audio signals before they are mixed which increases communication delays significantly.
Referring to FIG. 2, in another example, there is shown a simple traditional system with three endpoints 9, sending audio from “endpoint 1” and “endpoint 3” to be received by “endpoint 2”. The three endpoints 9 are connected to a traditional intercom server 11. As shown, “endpoint 1” and “endpoint 3” have to encode their audio before sending to the server 11. The traditional intercom server 11 receives the audio and needs to do jitter reducing calculations to time all channels together.
The intercom server 11 then decodes the audio and mixes it together. The result mix is then recompressed and forwarded to “endpoint 2”. “Endpoint 2” then has to do jitter reducing calculations and decode before playback.
In addition to the deficiencies mentioned above, the endpoints 9 receiving the pre-mixed signal of all active participants have no mean to know at any given time the origin of the speech being received (ie: from which participants), and also has no means to perform signal processing on a participant basis such as volume adjustments for specific endpoints or also audio routing to different sound devices. For instance, for particular applications, it could be desirable to route the flight director speech to a loud speaker at a high volume while the rest of the participants are heard only through a headset.
It is also known in the art that peer to peer (P2P) topology, in a multi party voice conversation, will require a large amount of bandwidth since each party needs to send its unidirectional audio stream to all participants, and hence each party will receive the audio streams of all participants. A 3-party conference call would produce six unidirectional audio streams. It will also require that the participant device does local mixing of all incoming audio streams which will demand an increasing amount of resources as the conference gets larger. This topology is appropriate when operated over a private Local Area Network (LAN) but clearly becomes inefficient when crossing sub networks. It also provides capabilities such as selective listening and multi intercom session participations.
Known to the Applicant are the following U.S. patents and/or patent applications: U.S. Pat. Nos. 6,438,111 B1; 6,671,262 B1; 6,782,413 B1; 6,687,358 B1; 6,717,921 B1; 6,728,221 B1; 6,940,826; 6,956,828; 2005/0068904 A1; 2005/0122389 A1; 2005/0135280 A1; and 2006/0146737 A1.
None of the above-mentioned documents describes or suggests an intercom system that can balance bandwidth requirements against the need to provide the conference or intercom system participants with various intercom features, such as selective listening and multi conferencing, without degrading voice quality and increasing delay.
Hence, in light of the aforementioned, there is a need for an improved intercom system, which by virtue of its design and components, would be able to overcome some of the above-discussed prior art problems.