The invention relates to rapidly and intuitively conveying accurate information about the spatial location of a simulated sound source to a listener over headphones through the use of enhanced head-related transfer functions (HRTFs).
HRTFs are digital audio filters that reproduce the direction-dependent changes that occur in the magnitude and phase spectra of the auditory signals reaching the left and right ears when the location of the sound source changes relative to the listener.
Head-related transfer functions (HRTFs) can be a valuable tool for adding realistic spatial attributes to arbitrary sounds presented over stereo headphones. However, in the past, HRTF-based virtual audio displays have rarely been able to reach the same level of localization accuracy that would be expected for listeners attending to real sound sources in the free field.
The present invention provides a novel HRTF enhancement technique that systematically increases the salience of the direction-dependent spectral cues that listeners use to determine the elevations of sound sources. The technique is shown to produce substantial improvements in localization accuracy in the vertical-polar dimension for individualized and non-individualized HRTFs, without negatively impacting performance in the left-right localization dimension.
The present invention produces a sound over headphones that appears to originate from a specific spatial location relative to the listener's head. One example of an application domain where this capability might be useful is in an aircraft cockpit display, where it might be desirable to produce a threat warning tone that appears to originate from the location of the threat relative to the location of the pilot. Since the 1970s, audio researchers have known that the apparent location of a simulated sound can be manipulated by applying a linear transformation known as the Head-Related Transfer Function (HRTF) to the sound prior to its presentation to the listener over headphones. In effect, the HRTF processing technique works by reproducing the interaural differences in time and intensity that listeners use to determine the left-right positions of sound sources and the pinna-based spectral shaping cues that listeners use for determining the up-down and front-back locations of sounds in the free field.
If the HRTF measurement and reproduction techniques are properly implemented, then it may be possible to produce virtual sounds over headphones that are completely indistinguishable from sounds generated by a real loudspeaker at the location where the HRTF measurement was made. Indeed, this level of real-virtual equivalence has been demonstrated in at least two experiments where listeners were unable to reliably distinguish the difference between sequentially-presented real and virtual sounds. However, demonstrations of this level of virtual sound fidelity have been limited to carefully controlled laboratory environments where the HRTF has been measured with the headphone used for the reproduction of the HRTF and the listener's head has been held completely fixed from the time the HRTF measurement was made to the time the virtual stimulus was presented to the listener.
In practical, virtual, audio display systems that allow listeners to make exploratory head movements while wearing removable headphones, it has historically been very difficult to achieve a level of localization performance that is comparable to free field listening. Listeners are generally able to determine the lateral locations of virtual sounds because these left-right determinations are based on interaural time delays (ITDs) and interaural level differences (ILDs) that are relatively robust across a wide range of listening conditions. However, listeners generally have extreme difficulty distinguishing between virtual sound locations that lie within a “cone-of-confusion.” FIG. 1 shows a cone of confusion 20 where all of the possible source locations are located at the same angle β from the listener's interaural x-y-z axis 22 and thus produce roughly the same ILD and ITD cues. Within this cone-shaped region, localization judgments have to be made solely on the basis of spectral cues generated by the direction-dependent filtering characteristics of the listener's external ear. If these spectral cues are not reproduced exactly by the virtual audio display system, this can lead to extremely poor localization performance in elevation and, in cases where the stimulus is not on long enough to allow the listener to make exploratory head movements, can lead to a large number of front-back confusions as disclosed in “The role of head movements and vestibular and visual cues in sound localization.” Journal of Experimental Psychology, 27, 339-368, 1940 by H. Wallach (This and all other references are herein incorporated by reference).
At least three factors conspire to make it very difficult to produce the level of spectral fidelity required to allow virtual sounds located within a cone of confusion to be localized as accurately as free-field sounds. The first relates to variability in frequency response that occurs across different fittings of the same set of stereo headphones on a listener's head. In most practical headphone designs, the variations in frequency response that occur when a headphone is removed and replaced on a listeners head are comparable in magnitude to the variations in frequency response that occur in the HRTF when a sound source changes location within a cone of confusion. This means that in most applications of spatial audio, free-field equivalent elevation performance can only be achieved in laboratory settings where the headphones are never removed from the listener's head between the time when the HRTF measurement is made and the time the headphones are used to reproduce the simulated spatial sound.
In the controlled laboratory setting used by Kulkarni, A., Isabelle, Colburn, H. (1999), “Sensitivity of human subjects to head-related transfer function phase spectra,” Journal of the Acoustical Society of America, 105(5), 2821-2840, it was possible to place the headphones on the listener's head, use probe microphones inserted in the ears to measure the frequency response of the headphones, create a digital filter to invert that frequency response, and use that digital filter to reproduce virtual sounds without ever removing the headphones. This precise level of headphone correction is unachievable in real-world applications of spatial audio, particularly where display designers must account for the fact that the headphones will be removed and replaced prior to each use of the system. This can introduce a substantial amount of spectral variability into the HRTF.
Another factor that can lead to reduced localization accuracy in practical spatial audio systems is the need to use interpolation to obtain HRTFs for locations where no actual HRTF has been measured. Most studies of auditory localization accuracy with virtual sounds have used fixed impulse responses measured at discrete sound locations to do the virtual synthesis. However, most practical spatial audio systems use some form of real-time head-tracking, which requires the interpolation of HRTFs between measured source locations. A number of different interpolation schemes have been developed for HRTFs, but whenever it becomes necessary to use interpolation techniques to infer information about missing HRTF locations there is sonic possibility for a reduction in fidelity in the virtual simulation.
A final factor that has an extremely detrimental impact on localization accuracy in practical spatial audio systems is the requirement to use individualized HRTFs in order to achieve optimum localization accuracy. The physical geometry of the external ear or pinna varies across listeners, and as a direct consequence there are substantial differences in the direction-dependent high-frequency spectral cues that listeners use to localize sounds within a “cone-confusion”. When a listener uses a spatial audio system that is based on HRTFs measured on someone else's ears, substantial increases in localization error can occur.
These complicating factors make it very difficult to produce a virtual audio system with directly-measured HRTF's capable of producing a high level of localization performance across a broad range of users. Consequently, a number of researchers have developed various methodologies for “enhancing” the measured HRTFs in order to improve localization performance.
Many of these enhancement methodologies involve “individualization” techniques designed to bridge the gap between the relatively high level of performance typically seen with individualized. HRTF rendering and the relatively poor level of performance that is typically seen with non-individualized HRTFs. One of the earliest examples of such a system provided listeners with the ability to manually adjust the gain of the HRTF in different frequency bands to achieve a higher level of spatial fidelity.
While there is evidence that these customization techniques can improve localization performance, they still require some modification of the HRTF to match the characteristics of the individual listener. There are many applications where this approach is not practical, and the designer will need to assume that all users of the system will be listening to the same set of unmodified non-individualized HRTFs. To this point, only a few techniques have been proposed that are designed to improve localization performance on a fixed set of HRTFs for an arbitrary listener.
One approach to solving this problem is to attempt to select the set of non-individualized HRTFs that will produce the best overall localization results across the broadest range of potential uses. This approach, which requires the measurement of HRTFs from a large number of listeners and the manual selection of the particular set of HRTFs for which the differences between the gains, in the frequency domain, from one human to another are very low, is described in U.S. Pat. No. 6,188,875 (Moller et al.).
Another approach is to actually modify the spectral characteristics of an HRTF in an attempt to obtain better localization performance. Gupta, N., Barreto, A, & Ordonez, C. (2002). “Spectral modification of head-related transfer functions for improved virtual sound spatialization,” Vol. 2, pp. 1953-1956 proposed a technique that modifies the spectrum of the HRTF in an attempt to recreate the effect of increasing the protrusion angle of the listener's ear. This technique essentially increases the gain of the HRTF at low frequencies for sources it the front hemisphere, and decreases the gain of the HRTF at high frequencies for sources in the rear hemisphere. The authors reported substantial reductions in front-back confusions for the localization of non-individualized virtual sounds in the horizontal plane. However this approach failed to provide the level of precise localization in spatial audio systems provided with the present invention.
Koo, K. & Cha, H. (2008). Enhancement of 3D Sound using Psychoacoustics. Vol. 27, pp. 162-166, have recently proposed another method that uses spectral modification to reduce the confusability of two virtual sounds, such as two points located at mirror image locations across the frontal plane that would ordinarily be highly likely to result in a front-back confusion. Their method appears to take the spectral difference between the HRTFs for the two confusable locations and add this difference to the HRTF at the first location to increase the magnitude of the spectral difference between the HRTFs of the two locations by a factor of two. They did not test localization with this technique, but they do report modest improvements in mean opinion score.
These two techniques in the prior art claim to have some success in helping to resolve front-back confusions for sounds located in the horizontal plane. However, neither of these techniques makes any claim to improve elevation localization accuracy for sounds located above and below the horizontal plane. The proposed invention diners from these techniques in that it provides a way to reliably enhance auditory localization accuracy in elevation for sounds located at any desired location, in both azimuth and elevation directions, relative to the listener.
The Head Related Transfer Function (HRTF) Enhancement for Improved Vertical-Polar Localization in Spatial Audio System described herein has numerous advantages over the existing techniques in the prior art for addressing this problem, including faster response time, fewer chances for human interpretation error, and compatibility with existing auditory hardware.