1. Technical Field
The present invention relates generally to 3D sound systems and, more particularly, it relates to systems and methods for use in the efficient generation of Head Related Transfer Functions (HRTFs).
2. Related Art
3D sound, or spatial sound, is becoming more and more common, e.g., in the generation of sound tracks for animated films and computer games. In order to understand 3D sound, it is important to distinguish it from monaural sound, stereo sound, and binaural sound. Monaural sound is sound that is recorded using one microphone. Because it is recorded using one microphone, the listener does not receive any sense of sound positioning when listening to monaural sound.
Stereo sound is recorded with two microphones several feet apart separated by empty space. When stereo sound is played back to a listener, the recording from one microphone goes in the left ear and the recording from the other microphone goes in the right ear. As a result of how the sound is recorded, i.e., two microphones separated by empty space, the listener often perceives that the sound is coming form a location within the listeners head. This is because humans do not normally hear sounds in the manner they are recorded in stereo audio recording and, therefore, the listener's head is acting as a filter to the incoming sound.
Binaural sound recordings, on the other hand, are more realistic from the human listener's point of view, because they are recorded in a manner that more closely resembles the human acoustic system. Binaural recordings are made with microphones embedded in a model human head. Such recordings yield sound that appears to be external to the listeners head, because the model head filters sound in a manner similar to a real human head.
3D sound takes the binaural approach one step further. 3D sound recordings are made with microphones in the ears of an actual person. These recordings are then compared with the original sounds to compute the person's HRTF. The HRTF is a linear function that is based on the sound source's position and takes into account many cues humans use to localize sounds. The HRTF is then used to develop coefficients for a Finite Impulse Response (FIR) filter pair (one for each ear) for each sound position within a particular sound environment. Thus, to place a sound at a certain position within a given sound environment, the set of FIR filters that corresponds to the position is applied to the incoming sound. This is how 3D or spatial sound is generated.
To fully understand 3D sound generation, a more complete understanding of the HRTF is required. To accurately synthesize a sound source with all the physical cues and source localization that it encompasses, the sound pressure that the source makes on the ear drum must be found. Thus, the impulse response h(t) from the source to the ear drum must be found. Such an impulse response h(t) is referred to as the Head-Related-Impulse-Response (HRIR), the Fourier transform H(f) of which is the HRTF. Once you know the HRTF for the left ear and the right ear, you can synthesize the 3D sound source accurately.
The HRTF is a complex function of three space coordinate variables and one frequency variable. But in spherical coordinates, for distances greater than approximately on meter, the source is said to be in the far field. In the far field, HRTF measurements fall off inversely with range. Thus, for HRTF measurements made in the far field, the HRTF is essentially reduced to a function of azimuth, elevation, and frequency.
Systems based on HRTFs are able to produce elevation and range effects as well as azimuth effects. Thus, such systems can create the impression of sound being at any desired 3D location within a given sound environment. This is done by filtering the sound source through a pair of filters corresponding to the HRTF pair, i.e., left and right ear HRTFs, for the given location. Therefore, in conventional HRTF systems, tables of filter coefficients are stored corresponding to HRTFs for different locations within the sound environment. The appropriate coefficients are then retrieved and applied to a pair of FIR filters through which an incoming sound is filtered before reaching the listener.
Several problems exist with such systems. For example, an infinite number of filter coefficients for an infinite number of HRTFs cannot feasibly be stored in 3D sound systems. Thus, a tradeoff must be made between the quality of the 3D sound and the number of coefficients used, i.e., the size of the FIR filters, as well as the number of HRTFs stored. Another problem relates to how the HRTFs are generated. Typically, the HRTFs will be generated from a sample group of individuals. Thus, a certain number of HRTF measurements will be made for the group. The HRTF measurements for the group will be converted into a certain number of coefficients. For example, Raw data for each member of the group may be taken every 10° along the azimuth plane from 180° to −180° and along the elevation plane in 10° increments from 80° to −80°.
This raw data may need to be converted or reduced, however, for a given sound environment in a given 3D sound system. For example, a given 3D sound system may use filter mapping that extends from 180° to −180° using 30° increments in the azimuth plane and from 54° to −36° using 18° increments in the elevation plane. Such a filter mapping may be required, for example, due to the nature of the sound environment or due to system limitation, such as limited memory to store the filter maps.
Therefore, the problem presented is how to take HRTF measurements for y-number of people that results in x-coefficients and convert them into one filter set with z-coefficients and have the set of z-coefficients be good enough to produce accurate, quality 3D sound for a general population? Present 3D sound systems incorporate the ability to perform such conversions into the system by incorporating the ability to perform complex signal processing. In fact, some systems include a separate dedicated DSP for performing the complex signal processing that is required. Unfortunately, this not only drives up the cost of such systems, the required signal processing also drives up the computational overhead of the system, resulting in an excessive amount of time to perform the required computations.
To reduce the amount of time and computational overhead required, some systems use data compression techniques. Such techniques, however, are inherently lossy and, therefore, result in poorer sound reproduction. In particular, the phase relationship between left and right ear signals can be greatly effected do to the lossy nature of compression techniques.