There is a growing interest to improve methods and systems for audio displays which can present audio signals conveying accurate impressions of three-dimensional sound fields. Such audio displays utilize techniques which model the transfer of acoustic energy in a soundfield from one point to another. A frequency-domain form of such models is referred to as an acoustic transfer function (ATF) and may be expressed as a function H(d,.theta.,.phi.,.omega.) of frequency .omega. and relative position (d,.theta.,.phi.) between two points, where (d,.theta.,.phi.) represents the relative position of the two points in polar coordinates. Other coordinate systems may be used.
Throughout the following discussion, more particular mention is made of various frequency-domain transfer functions; however, it should be understood that corresponding time-domain impulse response representations exist which may be expressed as a function of time t and relative position between points, or h(d,.theta.,.phi.,t). The principles and concepts discussed here are applicable to either domain.
An ATF may model the acoustical properties of a test subject. In particular, an ATF which models the acoustical properties of a human torso, head, ear pinna and ear canal is referred to as a head-related transfer function (HRTF). A HRTF describes, with respect to a given individual, the acoustic levels and phases which occur near the ear drum in response to a given soundfield. The HRTF is typically a function of both frequency and relative orientation between the head and the source of the soundfield. A HRTF in the form of a free-field transfer function (FFTF) expresses changes in level and phase relative to the levels and phase which would exist if the test subject was not in the soundfield; therefore, a HRTF in the form of a FFTF may be generalized as a transfer function of the form H(.theta.,.phi.,.omega.). The effects of distance can usually be simulated by amplitude attenuation proportional to the distance. In addition, high-frequency losses can be synthesized by various functions of distance. Throughout this discussion, the term HRTF and the like should be understood to refer to FFTF forms unless a contrary meaning is made clear by explanation or by context.
Many applications comprise acoustic displays utilizing one or more HRTF in attempting to "spatialize" or create a realistic three-dimensional aural impression. Acoustic displays can spatialize a sound by modelling the attenuation and delay of acoustic signals received at each ear as a function of frequency .omega. and apparent direction relative to head orientation (.theta.,.phi.). An impression that an acoustic signal originates from a particular relative direction (.theta.,.phi.) can be created in a binaural display by applying an appropriate HRTF to the acoustic signal, generating one signal for presentation to the left ear and a second signal for presentation to the right ear, each signal changed in a manner that results in the respective signal that would have been received at each ear had the signal actually originated from the desired relative direction.
Empirical evidence has shown that the human auditory system utilizes various cues to identify or "localize" the relative position of a sound source. The relationship between these cues and relative position are referred to here as listener "localization characteristics" and may be used to define HRTF. The differences in the amplitude and the time of arrival of soundwaves at the left and right ears, referred to as the interaural intensity difference (IID) and the interaural time difference (ITD), respectively, provide important cues for localizing the azimuth or horizontal direction of a source. Spectral shaping and attenuation of the soundwave provides important cues used to localize elevation or vertical direction of a source, and to identify whether a source is in front of or in back of a listener.
Although the type of cues used by nearly all listeners is similar, localization characteristics differ. The precise way in which a soundwave is altered varies considerably from one individual to another because of considerable variation in the size and shape of human torsos, heads and ear pinnae. Under ideal situations, the HRTF incorporated into an acoustic display is the personal HRTF of the actual listener because a universal HRTF for all individuals does not exist. Additional information regarding the suitability of shared HRTF may be obtained from Wightman, et al., "Multidimensional Scaling Analysis of Head-Related Transfer Functions," IEEE Workshop on Applications of Sig. Proc. to Audio and Acoust., October 1993.
In many practical systems, however, several HRTF known to work well with a variety of individuals are compiled into a library to achieve a degree of sharing. The most appropriate HRTF is selected for each listener. Additional information may be obtained from Wenzel, et al., "Localization Using Nonindividualized Head-Related Transfer Functions," J. Acoust. Soc. Am., vol. 94, July 1993, pp. 111-123.
The realism of an acoustic display can be enhanced by including ambient effects. One important ambient effect is caused by reflections. In most environments, a soundfield comprises soundwaves arriving at a particular point, say at an ear, along a direct path from the sound source and along paths reflecting off one or more surfaces of walls, floor, ceiling and other objects. A soundwave arriving after reflecting off one surface is referred to as a first-order reflection. The order of the reflection increases by one for each additional reflective surface along the path. The direction of arrival for a reflection is generally not the same as that of the direct-path soundwave and, because the propagation path of a reflected soundwave is longer than a direct-path soundwave, reflections arrive later. In addition, the amplitude and spectral content of a reflection will generally differ because of energy absorbing qualities of the reflective surfaces. The combination of high-order reflections produces the diffuse soundfields associated with reverberation.
A HRTF may be constructed to model ambient affects; however, more flexible displays utilize HRTF which model only the direct-path response and include ambient effects synthetically. The effects of a reflection, for example, may be synthesized by applying a direct-path HRTF of appropriate direction to a delayed and filtered version of the direct-path signal. The appropriate direction is the direction of arrival at the ear may be established by tracing the propagation path of the reflected soundwave. The delay accounts for the reflective path being longer than the direct path. The filtering alters the amplitude and spectrum of the delayed soundwave to account for acoustical properties of reflective surfaces, air absorption, nonuniform source radiation patterns and other propagation effects. Thus, a HRTF is applied to synthesize each reflection included in the acoustic display.
In many acoustic displays, HRTF are implemented as digital filters. Considerable computational resources are required to implement accurate HRTF because they are very complex functions of direction and frequency. The implementation cost of a high-quality display with accurate HRTF is roughly proportional to the complexity and number of filters used because the amount of computation required to perform the filters is significant as compared to the amount of computation required to perform all other functions. An efficient implementation of HRTF filters is needed to reduce implementation costs of high-quality acoustic displays. Efficiency is very important for practical displays of complex soundfields which include many reflections. The complexity is essentially doubled in binaural displays and increases further for multiple sources and/or multiple listeners.
The term "filter" and the like as used here refer to devices which perform an operation equivalent to convolving a time-domain signal with an impulse response. Similarly, the term "filtering" and the like as used here refer to processes which apply such a "filter" to a time-domain signal.
One technique used to increase the efficiency of spatializing late-arriving reflections is disclosed in U.S. Pat. No. 4,731,848. According to this technique, direct-path soundwaves and first-order reflections are processed in a manner similar to that discussed above. The diffuse soundwaves produced by higher-order reflections are synthesized by a reverberation network prior to spectral shaping and delays provided by "directionalizers."
Another technique used to increase the efficiency of spatializing early reflections is disclosed in U.S. Pat. No. 4,817,149. According to this technique, three separate processes are used to spatialize the direct-path soundwave, early reflections and late reflections. The direct-path soundwave is spatialized by providing front/back and elevation cues through spectral shaping, and is spatialized in azimuth by including either ITD or IID. The early reflections are spatialized by propagation delays and azimuth cues, either ITD or IID, and are spectrally shaped as a group to provide "focus" or a sense of spaciousness. The late reflections are spatialized in a manner similar to that done for early reflections except that reverberation and randomized azimuth cues are used to synthesize a more diffuse soundfield.
These techniques improve the efficiency of spatializing reflections but they do not improve the efficiency of spatializing a direct-path soundwave nor do they provide a way to more efficiently spatialize binaural displays, to spatialize multiple sources or present a spatialized display to multiple listeners.
A technique used to more efficiently spatialize an audio signal is implemented in the UltraSound.TM. multimedia sound card by Advanced Gravis Computer Technology Ltd., Burnaby, British Columbia, Canada. According to this technique, an initial process records several prefiltered versions of an audio signal. The prefiltered signals are obtained by applying HRTF representing several positions, say four horizontal positions spaced apart by 90 degrees and one or two positions of specified elevation. Spatialization is accomplished by mixing the prefiltered signals. In effect, spatialization is accomplished by panning between fixed sound sources. The spatialization process is fairly efficient and has an intuitive appeal; however, it does not provide very good spatialization unless a fairly large number of prefiltered signals are used. This is because each of the prefiltered signals include ITD, and a soundwave appearing to originate from an intermediate point cannot be reasonably approximated by a mix of prefiltered signals unless the signals represent directions fairly close to one another. Limited storage capacity usually restrict the number of prefiltered signals which can be stored. In addition, the technique imposes a rather serious disadvantage in that neither the HRTF nor the audio source can be changed without rerecording the prefiltered signals. This technique is described briefly in Begault, "3-D Sound for Virtual Reality and Multimedia," Academic Press, Inc., 1994, p. 210.
As explained above, accurate HRTF are expensive to implement because they are complex functions of direction and frequency. Research discussed in Martens, "Principal Components Analysis and Resynthesis of Spectral Cues to Perceived Direction," ICMC Proceedings, 1987, pp. 274-281, and in Kistler, et at., "A Model of Head-Related Transfer Functions Based on Principal Components Analysis and Minimum-Phase Reconstruction," J. Acoust. Soc. Am., March 1992, pp. 1637-1647, used principal component analysis to develop the concept that HRTF can be approximated fairly well by a small number of fixed-frequency-response basis functions. In particular, Kistler, et al. showed that as few as five log-magnitude basis functions could reasonably represent a direction-dependent portion of HRTF responses, referred to as directional transfer functions (DTF), for each ear of ten different test subjects. Direction-independent aspects such as ear canal resonance were excluded from the principal component analysis. Phase responses of the HRTF were approximated by ITD which were assumed to be frequency independent.
Kistler, et al. showed that binaural HRTF for a particular individual and specified direction can be approximated by scaling the log-magnitude basis functions with a set of weights, combining the scaled functions to obtain composite log-magnitude response functions representing DTF for each ear, deriving two minimum phase filters from the log-magnitude response functions, adding excluded direction-independent characteristics such as ear canal resonance to derive HRTF representations from the DTF representations, and calculating a delay for ITD to simulate phase response. Unfortunately, these basis functions do not provide for any improvement in implementation efficiency of HRTF. In addition, Kistler, et al. concluded that the principal component weights for the five basis functions were very complex functions of direction and could not be easily modeled.
There remains a need for a method to efficiently implement accurate HRTF, particularly for acoustic displays which spatialize multiple sources and/or generate unique displays for multiple listeners.