An audio conferencing terminal may be used both to pick up audio from, and play audio to, one or more people who are physically distant from, but in the same room as, the terminal. This is in contrast with a telephone handset, in which the microphone and speaker are adjacent to the mouth and ear of a single talker. Some audio conferencing terminals are capable of producing more than one audio stream containing audio from the room in which that terminal resides (commonly known as near-end audio). As an example, the terminal may include or take inputs from several microphones, each of which produces an audio stream. Alternately, it may perform beamforming or other processing on the outputs of several microphones, to produce multiple audio streams representing the streams that would emanate from virtual microphones or speakers at circumferential locations between the actual physical microphones or speakers. In such cases, it is often necessary to select one or more of the available audio streams that best represent near-end speech, for transmission to one or more remote endpoints. To this end, methods for selecting the best audio stream or streams from among several candidates have been developed.
From the viewpoint of the conference user, the best audio stream is that which most faithfully reproduces speech from an active talker (near-end speech). If there is no active talker in range of the conferencing terminal's microphones, the stream selection should remain fixed; in this case the selected stream may be that which was chosen for the previous active talker, or a fixed “no active talker” selection.
Reliable stream selection enhances the performance of the conferencing terminal by ensuring that the near-end audio is as clear as possible. In systems with spatial processing, such as beamforming, the best stream(s) may provide gain to the acoustic signal from an active talker(s), and attenuation to noise and echo signals. Thus reliable stream selection enhances the signal-to-noise and signal-to-echo ratios of the conferencing terminal, thereby improving the quality of the audio stream sent from the terminal.
In many applications, the audio streams entering the stream selection subsystem are known to have highest power for audio signals arriving from known directions. This is the case when the streams are the outputs of a beamformer, which processes the outputs of an array of microphones having a known geometry to produce streams having maximum energy for audio sources from specific directions. See Jacob Benesty, “On microphone array beamforming from a MIMO Acoustic signal processing perspective” IEEE Trans. Audio, Speech, and Language Processing, vol. 15, No. 3, March 2007 p. 1053; and S. Doclo and M. Moonen, “Design of broadband beamformers robust against gain and phase errors in the microphone array characteristics,” IEEE Trans. Signal Processing, vol. 51, no. 10, pp. 2511-2526, 2003. A similar situation arises when the streams are the outputs of fixed directional microphones. In such cases, stream selection provides information about the location of a talker, who is on or near a line extending from the audio terminal, in the direction of the maximum response of the selected stream. Such localization information is useful in systems which attempt to reproduce the spatial characteristics of sent audio at the playing end. Thus reliable stream selection is an important component within a conferencing system with spatial audio capabilities.
In systems utilizing fixed-geometry microphone arrays, an alternative to selecting one stream from a set of candidate streams, is to estimate the direction of arrival of the acoustic signal, which in turn allows formation of an optimum single stream. Direction of arrival may be estimated from time-of-arrival, or via beam-sweeping methods.
Alternatively, the localization and beamforming tasks can be combined in a system which uses a single method to both determine direction-of-arrival, and to combine microphone outputs into a single stream maximizing SNR or other performance metric in this direction.
These techniques tend to require greater computing power than systems utilizing a fixed beamformer followed by a stream selection subsystem. For a conferencing terminal in which computational resources are limited by constraints on space, power or cost, the latter type of system is often a better choice than one of the more adaptive localization techniques.
The task of selecting the best audio stream is made more difficult by the presence of acoustic noise sources, and also by the acoustic signal being played from the speakers (e.g., speech and noise from the remote conference participants, also known as far-end audio). For example, if the selection method is selected based on only acoustic signal strength, the stream may be selected which most faithfully reproduces noise or the audio from the external speakers, when these are the loudest noise source in the room. Acoustic echo of near-end audio also makes the stream selection task more difficult—although the correct stream may be selected while a talker is speaking, it may shift to select acoustic echoes of the speech after the person stops talking.
The basic elements of a stream selection subsystem include a metric or measurement of the performance of each stream, methods for comparing streams based on this metric and selecting for the best one(s), and a control entity. These three elements between them generally provide one or more mechanisms for choosing the best stream for representing near-end speech, while preventing switches to streams that best represent other sounds including background noise, acoustic echo from the speakers, or echoes (reverberation) of the near-end speech.
Solutions using a metric based on signal-to-noise ratio (SNR) can fail to track a talker near a noise source, i.e. such solutions can select a stream that does not correspond to a beam pointing at the talker's spatial location, but which nevertheless has a higher SNR. Although in some applications the highest-SNR stream might still be the preferred stream, this does not work for spatial audio. Furthermore spatial processing, such as beamforming, may change frequency response considerably for audio not arriving from the stream's main point directions, with the result that audio from a direction not in line with the main point direction of the selected spatial processing algorithm has an unrealistic filtered sound.
Solutions using a metric based purely on power may be confused by the presence of noise that is louder on one stream than on another. Without proper control logic, a power-based metric can cause the noisy stream to be selected whenever there is not louder speech present. This solution has necessitated the use of control logic that gates stream measurements and/or comparisons by the presence of near-end speech activity.
Selection methods using a level-based threshold ignore information about the relative strengths of different streams over time. A stream with a slight advantage might not exceed the selected stream by the threshold amount, and so may never be selected. This can occur when the signal is weak (as is the case when talkers are far from the audio terminal), or when there is a large overlap between streams in terms of their performances in representing near-end speech. The threshold may be adjusted to select such weakly advantageous streams; however this may result in spurious switches to incorrect streams when the signal is strong.
A similar problem occurs when stream switching is based a time-based threshold. If the time window is chosen to prevent spurious switches (e.g. due to reverberations or noise fluctuations) at a wide range of signal levels, it will be unnecessarily slow to switch when there is a stronger advantage for the new stream over the old (which may occur in the majority of cases).
The use of windowing mechanism described in US2002/001389A1 and U.S. Pat. No. 7,130,797 is one way to circumvent the difficulties described above. This mechanism uses a level-based threshold to arrive at instantaneous best-stream decisions, but then saves these instantaneous decisions in a FIFO. It then processes previous decisions over the length of the FIFO to determine the final best-stream output. However, since the instantaneous decisions use a level-based threshold, these lose information about the degree to which one stream was better than another at any instant. As such, they are vulnerable to reverberations and echoes. In U.S. Pat. No. 7,130,797, this difficulty is addressed by adding a second, short-time best-stream estimate. This “direct path” estimate is used to weight the instantaneous measurements that are used to produce a final best-stream estimate. This modification adds complexity to the overall stream selection subsystem, which translates into increased time and cost to implement the solution.