Historically, virtual audio displays have focused primarily on controlling the apparent direction of sound sources. This has been achieved by processing the sound with direction-dependent digital filters, called Head Related Transfer Functions (HRTFs), that reproduce the acoustic transformations that occur when a sound propagates from a distant source to the listener's left and right ears. The resulting processed sounds are presented to the listener over stereo headphones, and appear to originate from the direction relative to the listener's head corresponding to the location of the sound source during the HRTF measurement.
Only a few virtual audio display systems have attempted to control the apparent distances of sounds, all with limited success. In part, this is directly related to the lack of salient auditory distance cues in the free field. The binaural and spectral cues that listeners use to determine the directions of sound sources, which are captured by the HRTF and exploited by directional virtual audio displays, provide essentially no information about the distances of sound sources. Only when the sound source is within 1 m of the head are there any significant distance-dependent changes in the anechoic HRTF. Consequently, virtual audio displays are forced to rely on much less robust monaural cues to manipulate the apparent distances of sounds. Two types of monaural distance cues have been used in previous virtual audio displays. The first of these cues is based on intensity. In the free field, the overall level of the sound reaching the listener decreases 6 dB with each doubling in source distance. Listeners rely on this loudness cue to determine relative changes in the distances of sounds, so it is possible to reduce the apparent distance of a sound in an audio display simply by increasing its amplitude. A number of earlier audio displays have used intensity cues to manipulate apparent distance.
While the intensity cue is useful for simulating changes in the relative distance of a sound, it provides little or no information about the absolute distance of the sound unless the listener has substantial a priori knowledge about the intensity of source. Thus, listeners generally will not be able to identify the distance of a sound source in meters or feet from the intensity cue alone. The intensity cue also requires a wide dynamic range to be effective. Since the source intensity must increase 6 dB each time the distance of the source is decreased by half, 6 dB of dynamic range is required for each factor of 2 change in simulated distance. This is not a problem in quiet listening environments, but in noisy environments like aircraft cockpits, where virtual audio displays are often most valuable, the range of distance manipulation possible with intensity cues is very limited. Far away sounds will be attenuated below the noise floor and become inaudible, and nearby sounds will be uncomfortably loud or will overdrive the headphone system. It has been recognized in the prior art that all distances should be scaled to the range from 10 cm to 10 m from the listener's head in order to make the loudness cue effective in aerospace applications. Even this compressed range of simulated distances would require a dynamic range of 27 dB, which would be difficult to achieve in the cockpit of a tactical jet aircraft.
The second type of cue that has been used in known audio distance displays is based on reverberation. In a reverberant environment, the direct signal from the source decreases in amplitude 6 dB for each doubling in distance, while the reverberant sound in the room is roughly independent of distance. Consequently, it is possible to determine the distance of a sound source from the ratio of direct energy to reverberant energy in the audio signal. When the source is nearby, the direct-to-reverberant ratio is large, and when the source is distant, this direct-to-reverberant ratio is small. This cue has previously been used to manipulate apparent distance in a virtual audio display. The importance of reverberation in human distance perception has been demonstrated in psychoacoustic experiments and it is known to provide some information about the absolute distance of a sound. However, it also has serious drawbacks. The dynamic range requirements of the reverberation cue are just as demanding as those with the intensity cue, since the direct sound level changes 6 dB with each doubling in distance and must be audible in order to determine the direct-to-reverberant energy ratio. Reverberation cues are also computationally intensive, since each simulated room reflection requires as much processing power as a single source in an anechoic environment. They require the listener to have some a priori knowledge about the reverberation properties of the listening environment, and may produce inaccurate distance perception when the simulated listening environment does not match the visual surroundings of the listener. And reverberation can decrease the intelligibility of speech and the listener's ability to localize the directions of all types of sounds.
One type of auditory distance cue that has not been exploited in any previous virtual audio displays is based on the changes that occur in the characteristics of speech when the talker increases the output level of his or her voice. These changes make it possible for a listener to estimate the output level of the talker solely from the acoustic properties of the speech signal. Whispered speech, for example, is easily identified from the lack of voicing and implies a relatively low production level. Shouted speech, which is characterized by a higher fundamental frequency and greater high-frequency energy content than conversational speech, implies a relatively high production level. Since the intensity of the speech signal decreases 6 dB for each doubling in the distance of the talker, a listener should be able to estimate the distance of a live talker in the free field by comparing the apparent production level of speech to the level of the signal heard at the ears.
The salience of these voice-based distance cues has been confirmed in perceptual studies, which have shown that listeners can make reasonably accurate judgments about the distances of live talkers. Other studies have shown that whispered speech is perceived to be much closer than conversational speech and conversational speech is perceived to be much closer than shouted speech when all three types of speech are presented at the same listening level.
The present invention relies on the novel concept that virtual synthesis techniques can be used to systematically manipulate the perceived distance of speech signals over a wide range of distances. The present invention illustrates that the apparent distances of synthesized speech signals can be reliably controlled by varying the vocal effort and loudness of the speech signal presented to the listener and that these speech-based distance cues are remarkably robust across different talkers, listeners, and utterances. The invention described herein is a virtual audio display that uses manipulations in the vocal effort and presentation level to control the apparent distances of synthesized speech signals.