In the past the voices of speakers in a multi-party audio conference system typically have been rendered to the listeners as a monaural audio stream—essentially overlaid on top of each other and usually presented to the listener “within the head” when headphones are used.
A virtual spatial audio conference system, which is a special form of a multiparty telemeeting as defined by the ITU-T recommendation P.1301 “Subjective quality evaluation of audio and audiovisual multiparty telemeetings”, enables a 3D audio rendering of the voices of the participants. That is, the participants' voices are placed at different “virtual” locations in space by using spatial filters derived from head-related impulse responses (HRIR) or their corresponding frequency-domain representations, i.e. head-related transfer functions (HRTFs), and/or binaural room impulse responses (BRIR) or their corresponding frequency-domain representations, i.e. binaural room transfer functions (BRTF). These filters encode the auditory cues humans use for spatial sound perception, namely interaural time difference (ITD), interaural level difference (ILD), spectral cues, and also room acoustic information, such as reverberation in the case of BRIRs. The beneficial effect of 3D audio rendering relative to a monaural audio stream of the voices of the participants is not only that the conference experience is more natural, but that also speech intelligibility is substantially enhanced. It has been shown that this psychoacoustic effect, scientifically known as spatial release from masking, can improve speech intelligibility by up to 12-13 dB when a target speaker and competing speakers, typically referred to as maskers, are virtually spatially separated.
U.S. Pat. No. 7,391,877 describes a spatial sound processor that virtually distributes speakers over non-equidistant positions along a circle centered at the listener's position. Based on results from psychoacoustic tests on speech identification the system starts with a relatively small virtual spatial separation for speakers placed in front of the listener. The virtual spatial separation between speakers is then increased as speakers are placed at more lateral positions. For directions ±90 degrees in azimuth, two virtual speaker locations are proposed, one in the far-field and one in the near-field. Similar solutions based on either equidistant or non-equidistant speakers are described in WO2013/142641 and WO2013/142668.
There have been some attempts to use the information contained in the voice signals themselves to enhance speech intelligibility. These attempts, i.e. the use of voice information to separate maskers from speakers, rely heavily on the amount of spectral overlap that exists between a target speaker and maskers, i.e. energetic masking. Ideal time-frequency binary masks have been proposed, for instance in Brungart et al “Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation”, J. Acoust. Soc. Am., volume 120, no. 6, 2006, in order to remove time-frequency regions where masker(s) energy dominates and preserve only those time-frequency regions where the energy of the target's voice dominates. They are ideal because access to the clean (original) speech signals from target speaker and masker(s) speaker(s) is required. More specifically, a priori knowledge about the target speaker and masker speakers is required so that those time-frequency regions of the acoustic mixture dominated by the target speaker can be preserved. In practice, however, sometimes the target speaker is not known a priori or is variable. In a virtual spatial audio conference, for instance, each participant can be the target speaker for a certain period of time.
Thus, there is a need for an improved audio signal processing apparatus and method, in particular an audio signal processing apparatus and method improving speech intelligibility in a virtual spatial audio conference system.