Video and audio conferences, or conference calls, may be held involving multiple participants located at the same physical location, such as the same conference room, with other participants located at a remote location. When a video/audio conference includes multiple participants at the same physical locale, the participants' experience is often not optimal.
The “same locale” problem in video/audio conferencing arises from the fact that generally only one actual device is able to be active for the conference call in any given locale. The device will inevitably be further from some participants in the room than others. This results in varied intelligibility of participants based solely on how near or far the participants are from the single active microphone. Further, the experience of those participants in the shared locale itself is also less than ideal since those furthest from the device are often left struggling to hear the single in-room device's speaker output.
Generally the reason that only one device per locale can be active at a time during a conference call has to do with the way that current Voice over Internet Protocol (VoIP) conferencing software is designed. When one participant connects to a cloud based or peer-to-peer conference call, the participant is sent an audio mix of all other participants except himself. This makes sense as one has no need to hear oneself echoed back to him when speaking.
This audio mixing scheme is generally done per-connection and is completely ignorant of each connection's locale or physical location. That is, a cloud based or peer-to-peer conference call system or method is not aware of the physical location of each connection to the conference call. The audio mixing scheme works well for person-to-person calls and is the current approach employed by cloud based conferencing services such as Skype, GoToMeeting, Facetime, WebEx and others. However, on multi-participant calls when some number of participants shares the same physical location, the experience degrades and the problems associated with multiple participants at the same locale are compounded as more and more participants share the same locale. In other words, the more participants per physical locale, the more the overall experiential quality of the conference call degrades.
Using the conventional VoIP conferencing system, the conference experience may be improved by having multiple participants in the same physical locale use multiple devices to join the conference. That is, every participant has his/her own microphone and own set of speakers. However, using multiple active devices at the same physical location to hold a conventional VoIP conference call is actually not possible. The first issue is audio feedback. Since each same-locale participant receives an audio mix with only his own microphone absent, the participant will still be receiving the microphones of his colleagues who are connected in the same physical locale. The colleagues' audio stream will come out of the participant's speaker and go back into the colleagues' microphones where the audio signal originated and back out the participant's speaker and back into the colleagues' microphones, and so on. This results in the same classic audio feedback problem associated with a Public Address (PA) system when the speaker accidentally points his microphone at the PA speaker. A high pitched squeal ensues which is highly unpleasant.
In the conventional cloud-based VoIP conferencing system, the feedback problem can be eliminated by giving the audio server producing each connection's audio mix a token which identifies participants sharing the same physical locale as such. The conferencing system would use that token to produce locale-based mixes where the mix sent to any connection in a particular physical locale would be absent of the microphone signals of the multiple devices sharing that locale. Thus, if participant A and participant B are in the same room, participant A's microphone signal would not emanate from participant B's speaker and vice versa. The feedback problem is thus solved.
However, even with the feedback problem solved, other problems remain. For instance, each participant's audio client connection will have different latencies. Devices generally have different audio device hardware and driver buffer sizes which is one factor contributing to connection latency variations. An even bigger contribution to differing connection latencies is due to the nature of the Internet itself.
In a packet-based VoIP system, one technique employed to create the illusion of a continuous media stream is the use of adaptive jitter buffering. These buffers dynamically adjust to an ideal size to minimize packet loss with the lowest possible latency based on current network conditions. Since network conditions change often due to congestion, topologies, WiFi loss, etc., each audio client will generally end up with differently sized jitter buffers relative to one another throughout the course of a conference call.
The end result of these differing connection latencies is a poor conference experience on both ends of the call. In the shared physical locale, the participants will experience unsynchronized speaker signals among the in-room devices, which will render the conference call unusable. In the remote locale, the participants will have a similar experience. The microphone signal the remote participants will be receiving from the shared locale will contain a mix of all of the in-room device microphones, which will be similarly desynchronized and jumbled because of the differing latencies.
A side effect of the desynchronization in the shared locale is that the in-room device's Acoustic Echo Cancellation (AEC) software will fail to function for its intended purpose of echo cancellation. AECs are designed to cancel the remote (farend) signal being played out of a device's speakers from the device's microphone (nearend) signal being sent up to the audio server. The desynchronization of the microphone signals at the shared locale will cause the AEC operation to become ineffective so that remote participants end up hearing themselves echoed back to them anytime they speak. This severely degrades the participant experience. In some cases, the remote participants may hear their own voices echoed multiple times, depending on the number of devices in the shared locale. Therefore, the conventional VoIP conference system does not permit multiple active devices to be used at the same physical location for holding a conference call.