The field of the invention is multi-talker communication systems. Many important communications tasks require listeners to extract information from a target speech signal that is masked by one or more competing talkers. In real-world environments, listeners are generally able to take advantage of the binaural difference cues that occur when competing talkers originate at different locations relative to the listener's head. This so-called “cocktail party” effect allows listeners to perform much better when they are listening to multiple voices in real-world environments where the talkers are spatially-separated than they do when they are listening with conventional electroacoustic communications systems where the speech signals are electronically mixed together into a single signal that is presented monaurally or diotically to the listener over headphones.
Prior art has recognized that the performance of multitalker communications systems can be greatly improved when signal-processing techniques are used to reproduce the binaural cues that normally occur when competing talkers are spatially separated in the real world. These spatial audio displays typically use filters that are designed to reproduce the linear transformations that occur when audio signals propagate from a distant sound source to the listener's left or right ears. These transformations are generally referred to as head-related transfer functions, or HRTFs. If a sound source is processed with digital filters that match the HRTFs of the left and right ears and then presented to the listener through stereo headphones, it will appear to originate from the location relative to the listener's head where the HRTF was measured. Prior research has shown that speech intelligibility in multi-channel speech displays is substantially improved when the different competing talkers are processed with HRTF filters for different locations before they are presented to the listener.
TABLE 1Summary of locations used to spatially separate talkers in prior artStudy# of TalkersTalker Locations1)Cherry (1953)2Non-spatial(left ear only,right ear only)2)Triesman (1964)3Non-spatial(left ear only, rightear only, both ears)3)Moray et al. (1964)4Non spatial(L only, 2/3 L + 1/3R; 1/3 L + 2/3 R; Ronly)4)Abouchacra et al. (1997)3−20, 0, 20 azimuthor −90, 0, 90azimuth5)Spieth et al. (1954)4−90, −45, +45, +90Azimuth6)Drullman & Bronkhorst (2000)4−90, −45, 0, +45,+907)Yost (1996)7 (3)−90, −60, −30, 0,+30, +60, +90azimuth8)Hawley et al. (1999)7 (2-4)−90, −60, −30, 0,+30, +60, +90azimuth9)Crispien & Ehrenberg (1995)4−90 az, +60 el; −30az, +20 el; −30 az,−20 el; −90 az,−60 el10)Nelson et al. (1998)8 (2-8)6: −90, −70, −31,+31, +70, +907: −90, −69, −45, 0,+45, +69, +908: −90, −69, −45,−11, +11, +45, +69,+90 azimuth11)Simpson et al. (1998)8 (2-8)7: −90, −69, −135,0, +135, +69, +908: −90, −69, −135,−11, +11, +135,+69, +90 azimuth12)Ericson & McKinley (1997)4−135, −45, +45,+135 azimuth (w/head tracking)13)Brungart & Simpson (2001)290 degrees azimuth,1 m; 90 degreesazimuth, 12 cm
Although a number of different systems have demonstrated the advantages of spatial filtering for multi-talker speech perception, very little effort has been made to systematically develop an optimal set of HRTF filters capable of maximizing the number of talkers a listener can simultaneously monitor while minimizing the amount of interference between the different competing talkers in the system. Most systems that have used HRTF filters to spatially separate speech channels have placed the competing channels at roughly equally spaced intervals in azimuth in the listener's frontal plane. Table 1 provides examples of the spatial separations used in previous multi-talker speech displays. The first three entries in the table represent early systems that used stereo panning over headphones rather than head-related transfer functions to spatially separate the signals. This method has been shown to be very effective for the segregation of two talkers (where the talkers are presented to the left and right earphone), somewhat effective for the segregation of three talkers (where one talker is presented to the left ear, one talker is presented to the right ear, and one talker is presented to both ears), and only moderately effective in the segregation of four talkers (where two talkers are presented to the left and right ears, one talker is presented more loudly in the left ear than in the right ear, and one talker is presented more loudly in the right ear than the left ear). However, these panning methods have not been shown to be effective in multi-talker listening configurations with more than four talkers.
The other entries in the table represent more recent implementations that either used loudspeakers to spatially separate the competing speech signals or used HRTFs that accurately reproduced the interaural time and intensity difference cues that occur when real sound sources are spatially separated around the listener's head. The majority of these implementations (entries 4-8 in Table 1) have used talker locations that were equally spaced in the azimuth across the listener's frontal plane. One implementation (entry 9 in Table 1) has spatially separated the speech signals in elevation as well as azimuth, varying from +60 degrees elevation to −60 degrees elevation as the source location moves from left to right. And two implementations (entries 10 and 11 in Table 1) have used a location selection mechanism that selects talker locations in a procedure designed to maximize the difference in source midline distance (SML) between the different talkers in the stimulus.
Recently, a talker configuration has been proposed in which the target and masking talkers are located at different distances (12 cm and 1 m) at the same angle in azimuth (90 degrees) (entry 13 in Table 1). This spatial configuration has been shown to work well in situations with only two competing talkers, but not with more than two competing talkers.
No previous studies have objectively measured speech intelligibility as a function of the placement of the competing talkers. However, recent results have shown that equal spacing in azimuth cannot produce optimal performance in systems with more than five possible talker locations. Tests have also shown that the performance of a multi-talker speech display can be improved by carefully balancing the relative levels of the different speech signals in the stimulus. The present invention consists of optimal HRTF spatial configurations that have been carefully designed to maximize speech intelligibility in multi-talker speech displays, and a method of normalizing the relative levels of the different talkers in a multi-talker speech display that improves overall performance even in conventional multi-talker spatial configurations.