Head related transfer functions (HRTFs) are digital audio filters that reproduce direction-dependent changes that occur in the magnitude and phase spectra of an auditory signal reaching the left and right ears when the location of the sound source changes relative to the listener. HRTFs can be a valuable tool for adding realistic spatial attributes to arbitrary sounds presented over stereo headphones. However, conventional HRTF-based virtual audio systems have rarely been able to reach the same level of localization accuracy that would be expected for listeners attending to real sound sources in the free field.
Since the 1970s, audio researchers have known that the apparent location of a simulated sound can be manipulated by applying a linear transformation. HRTFs, to the sound prior to its presentation to the listener over headphones. In effect, the HRTF processing technique works by reproducing the interaural differences in time and intensity that listeners use to determine the left-right positions of sound sources and the pinna-based spectral shaping cues that listeners use for determining the up-down and front-back locations of sounds in the free field.
If the HRTF measurement and reproduction techniques are properly implemented, then it may be possible to produce virtual sounds over headphones that are completely indistinguishable from sounds generated by a real loudspeaker at a location where the HRTF measurement was made. Indeed, this level of real-virtual equivalence has been demonstrated in experiments where listeners were unable to reliably distinguish the difference between sequentially-presented real and virtual sounds. However, demonstrations of this level of virtual sound fidelity have been limited to carefully controlled, laboratory environments where the HRTF has been measured with the headphone used for reproducing the HRTF, and the listener's head was fixed from the time the HRTF measurement was made to the time the virtual stimulus was presented to the listener.
Virtual audio display systems allow listeners to make exploratory head movements while wearing removable headphones; however, it has historically been very difficult to achieve a level of localization performance that is comparable to free field listening. Listeners are generally able to determine lateral locations of virtual sounds because these left-right determinations are based on interaural time delays (ITDs) and interaural level differences (ILDs) that are relatively robust across a wide range of listening conditions. However, listeners generally have extreme difficulty distinguishing between virtual sound locations that lie within a so-called “cone-of-confusion,” FIG. 1 illustrates such a conventional cone of confusion 10 where all possible source locations that produce roughly the same LLD and ITD cues are positioned at an angle, β, from an interaural x-y-z axis 12. Within this cone 10, localization judgments have to be made solely on the basis of spectral cues generated by the direction-dependent filtering characteristics of the listener's external ear. If spectral cues are not precisely reproduced by the virtual audio display system, then poor localization performance in elevation may result.
There are at least three factors that contribute to the difficulty in producing a level of spectral fidelity to allow virtual sounds located within the cone of confusion 10 to be localized as accurately as free-field sounds. One such factor relates to the variability in frequency response that occurs across different fittings of the same set of stereo headphones on a listener's head. In most practical headphone designs, the variations in frequency response that occur when headphones are removed and replaced on a listener's head are comparable in magnitude to the variations in frequency response that occur in the HRTF when a sound source changes location within the cone of confusion 10. This means that in most applications of spatial audio, free-field equivalent elevation performance can only be achieved in laboratory settings where the headphones are never removed from the listener's head between the time when the HRTF measurement is made and the time the headphones are used to reproduce the simulated spatial sound.
In a controlled laboratory setting used by KULKARNI, it was possible to place the headphones on the listener's head, use probe microphones inserted into the ears to measure the frequency response of the headphones, create a digital filter to invert that frequency response, and use that digital filter to reproduce virtual sounds without ever removing the headphones (KULKARNI, A. et al., “Sensitivity of human subjects to bead-related transfer function phase spectra,” Journal of the Acoustical Society of America, Vol. 105 (1999) 2821-2840, the disclosure of which is incorporated herein by reference, in its entirety). This precise level of headphone correction is unachievable in real-world applications of spatial audio, particularly where display designers must account for the fact that the headphones will be removed and replaced prior to each use of the system.
Another factor that can lead to reduced localization accuracy in conventional spatial audio systems is the use of interpolation to obtain HRTFs for locations of which no actual HRTF has been measured. Most studies of auditory localization accuracy with virtual sounds have used fixed impulse responses measured at discrete sound locations to do virtual synthesis. However, most practical spatial audio systems use some form of real-time head-tracking, which requires an interpolation of HRTFs between measured source locations. A number of different interpolation schemes have been developed for HRTFs, but whenever it becomes necessary to use interpolation techniques to infer information about missing HRTF locations there is some possibility for a reduction in fidelity in the virtual simulation.
Another factor that has a detrimental impact on localization accuracy in conventional spatial audio systems is the use of individualized HRTFs in order to achieve optimum localization accuracy. The physical geometry of the external ear (or pinna) varies between listeners and, as a direct consequence, there are substantial differences in the direction-dependent high-frequency spectral cues that listeners use to localize sounds within the cone-of-confusion 10. When a listener uses a spatial audio system that is based on HRTFs measured of another listener's ears, substantial increases in localization error can occur.
Conventional attempts to overcome these factors have included enhancement methodologies, such as individualization techniques, that are designed to bridge the gap between the relatively high level of performance typically seen with individualized HRTF rendering and the relatively poor level of performance that is typically seen with non-individualized HRTFs. An early example of such a system provided listeners with the ability to manually adjust the gain of the HRTF in different frequency hands to achieve a higher level of spatial fidelity. Further, conventional HRTF enhancement algorithms have focused on improving performance for non-individualized HRTFs and have not been shown to improve performance for individualized HRTFs.
While there is evidence that these customization techniques can improve localization performance, additional modification to the HRTF is necessary to match the characteristics of the individual listener. Still, many applications exist in which this approach is not practical and the designer will need to assume that all users of the system will be listening to the same set of unmodified non-individualized HRTFs. To this point, only a few techniques have been proposed that are designed to improve localization performance on a fixed set of HRTFs for an arbitrary listener.