1. Field of the Invention
This invention pertains generally to spatial sound capture and reproduction, and more particularly to methods and systems for capturing and reproducing the dynamic characteristics of three-dimensional spatial sound.
2. Description of Related Art
There are a number of alternative approaches to spatial sound capture and reproduction, and the particular approach used typically depends upon whether the sound sources are natural or computer-generated. An excellent overview of spatial sound technology for recording and reproducing natural sounds can be found in F. Rumsey, Spatial Audio (Focal Press, Oxford, 2001), and a comparable overview of computer-based methods for the generation and real-time “rendering” of virtual sound sources can be found in D. B. Begault, 3-D Sound for Virtual Reality and Multimedia (AP Professional, Boston, 1994). The following is an overview of some of the better known approaches.
Surround sound (e.g. stereo, quadraphonics, Dolby® 5.1, etc.) is by far the most popular approach to recording and reproducing spatial sound. This approach is conceptually simple; namely, put a loudspeaker wherever you want sound to come from, and the sound will come from that location. In practice, however, it is not that simple. It is difficult to make sounds appear to come from locations between the loudspeakers, particularly along the sides. If the same sound comes from more than one speaker, the precedence effect results in the sound appearing to come from the nearest speaker, which is particularly unfortunate for people seated close to a speaker. The best results restrict the listener to staying near a fairly small “sweet spot.” Also, the need for multiple high-quality speakers is inconvenient and expensive and, for use in the home, many people find the use of more than two speakers unacceptable.
There are alternative ways to realize surround sound to lessen its limitations. For example, home theater systems typically provide a two-channel mix that includes psychoacoustic effects to expand the sound stage beyond the space between the two loudspeakers. It is also possible to avoid the need for multiple loudspeakers by transforming the speaker signals to headphone signals, which is the technique used in the so-called Dolby® headphones. However, each of these alternatives also has its own limitations.
Surround sound systems are good for reproducing sounds coming from a distance, but are generally not able to produce the effect of a source that is very close, such as someone whispering in your ear. Finally, making an effective surround-sound recording is a job for a professional sound engineer; the approach is unsuitable for teleconferencing or for an amateur.
Another approach is Ambisonics™. While not widely used, the Ambisonics approach to surround sound solves much of the problem of making the recordings (M. A. Gerzon, “Ambisonics in multichannel broadcasting and video,” Preprint 2034, 74th Convention of the Audio Engineering Society (New York, Oct. 8-12, 1983); subsequently published in J. Aud. Eng. Soc., Vol. 33, No. 11, pp. 859-871 (October, 1985)). It has been described abstractly as a method for approximating an incident sound field by its low-order spherical harmonics (J. S. Bamford and J. Vanderkooy, “Ambisonic sound for us,” Preprint 4138, 99th Convention of the Audio Engineering Society (New York, Oct. 6-9, 1995)). Ambisonic recordings use a special, compact microphone array called a SoundField™ microphone to sense the local pressure plus the pressure differences in three orthogonal directions. The basic Ambisonic approach has been extended to allow recording from more than three directions, providing better angular resolution with a corresponding increase in complexity.
As with other surround-sound methods, Ambisonics uses matrixing methods to drive an array of loudspeakers, and thus has all of the other advantages and disadvantages of multi-speaker systems. In addition, all of the speakers are used in reproducing the local pressure component. As a consequence, when the listener is located in the sweet spot, that component tends to be heard as if it were inside the listener's head, and head motion introduces distracting timbral artifacts (W. G. Gardner, 3-D Audio Using Loudspeakers (Kluwer Academic Publishers, Boston, 1998), p. 18).
Wave-field synthesis is another approach, although not a very practical one. In theory, with enough microphones and enough loudspeakers, it is possible to use sounds captured by microphones on a surrounding surface to reproduce the sound pressure fields that are present throughout the interior of the space where the recording was made (M. M. Boone, “Acoustic rendering with wave field synthesis,” Proc. ACM SIGGRAPH and Eurographics Campfire: Acoustic Rendering for Virtual Environments, Snowbird, Utah, May 26-29, 2001)). Although the theoretical requirements are severe (i.e., hundreds of thousands of loudspeakers), systems using arrays of more than 100 loudspeakers have been constructed and are said to be effective. However, this approach is clearly not cost-effective.
Binaural capture is still another approach. It is well known that it is not necessary to have hundreds of channels to capture three-dimensional sound; in fact, two channels are sufficient. Two-channel binaural or “dummy-head” recordings, which are the acoustic analog of stereoscopic reproduction of 3-D images, have long been used to capture spatial sound (J. Sunier, “Binaural overview: Ears where the mikes are. Part I,” Audio, Vol. 73, No. 11, pp. 75-84 (November 1989); J. Sunier, “Binaural overview: Ears where the mikes are. Part II,” Audio, Vol. 73, No. 12, pp. 49-57 (December 1989); K. Genuit, H. W. Gierlich, and U. Künzli, “Improved possibilities of binaural recording and playback techniques,” Preprint 3332, 92nd Convention Audio Engineering Society (Vienna, March 1992)). The basic idea is simple. The primary source of information used by the human brain to perceive the spatial characteristics of sound comes from the pressure waves that reach the eardrums of the left and right ears. If these pressure waves can be reproduced, the listener should hear the sound exactly as if he or she were present when the original sound was produced.
The pressure waves that reach the ear drums are influenced by several factors, including (a) the sound source, (b) the listening environment, and (c) the reflection, diffraction and scattering of the incident waves by the listener's own body. If a mannequin having exactly the same size, shape, and acoustic properties as the listener is equipped with microphones located in the ear canals where the human ear drums are located, the signals reaching the eardrums can be transmitted or recorded. When the signals are heard through headphones (with suitable compensation to correct for the transfer function from the headphone driver to the ear drums), the sound pressure waveforms are reproduced, and the listener hears the sounds with all the correct spatial properties, just as if he or she were actually present at the location and orientation of the mannequin. The primary problem is to correct for ear-canal resonance. Because the headphone driver is outside the ear canal, the ear-canal resonance appears twice; once in the recording, and once in the reproduction. This has led to the recommendation of using so-called “blocked meatus” recordings, in which the ear canals are blocked and the microphones are flush with the blocked entrance (H. Møller, “Fundamentals of binaural technology,” Applied Acoustics, Vol. 36, No. 5, pp. 171-218 (1992)). With binaural capture, and, in particular, in telephony applications, the room reverberation sounds natural. It is a universal experience with speaker phones that the environment sounds excessively hollow and reverberant, particularly if the person speaking is not close to the microphone. When heard with a binaural pickup, awareness of this distracting reverberation disappears, and the environment sounds natural and clear.
Still, there are problems associated with binaural sound capture and reproduction. The most obvious problems are actually not always important. They include (a) the inevitable mismatch between the size, shape, and acoustic properties of a mannequin and any particular listener, including the effects of hair and clothing, (b) the differences between the eardrum and a microphone as a pressure sensing element, and (c) the influence of non-acoustic factors such as visual or tactile cues on the perceived location of sound sources. In the KEMAR™ mannequin, for example, considerable effort was devoted to using a so-called “Zwislocki coupler” to simulate the effects of the eardrum impedance (M. D. Burkhard and R. M. Sachs, “Anthropometric manikin for auditory research,” J. Acoust. Soc. Am., Vol. 58, pp. 214-222 (1975). KEMAR is manufactured by Knowles Electronics, 1151 Maplewood Drive, Itasca, Ill., 60143). However, it will be appreciated that microphones, good as they can be, are not equivalent to eardrums as transducers.
A much more important limitation is the lack of the dynamic cues that arise from motion of the listener's head. Suppose that a sound source is located to the left of the mannequin. The listener will also hear the sound as coming from the listener's left side. However, suppose that the listener turns to face the source while the sound is active. Because the recording is unaware of the listener's motion, the sound will continue to appear to come from the listener's left side. From the listener's perspective, it is as if the sound source moved around in space to stay on the left side. If there are many sound sources active, when the listener moves, the experience is that the whole acoustic world moves in exact synchrony with the listener. To have a sense of “virtual presence,” that is, of actually being present in the environment where the recording was made, stationary sound sources should remain stationary when the listener moves. Said another way, the spatial locations of virtual auditory sources should be stable and independent of motions of the listener.
There is reason to believe that the effects of listener motion are responsible for another defect of binaural recordings. It is a universal experience when listening to binaural recordings that sounds to the left or right seem to be naturally distant, but sounds that are directly ahead always seem to be much too close. In fact, some listeners experience the sound source as being inside their heads, or even in back. Several reasons have been advanced for this loss of “frontal externalization.” One argument is that we expect to see sound sources that are directly ahead of us, and when the confirming visual cue is absent, we tend to project the location of the source behind us. Indeed, in real-life situations it is frequently difficult to tell whether a source of sound is in front of us or behind us, which is why we turn to look around when we are unsure. However, it is not necessary to turn completely around to resolve front/back ambiguity. Suppose that a sound source is located anywhere in the vertical median plane. Because our bodies are basically symmetrical about this plane, the sounds reaching the two ears will be essentially the same. But suppose that we turn our heads a small amount to the left. If the source were actually in front, the sound would now reach the right ear before reaching the left ear, whereas if the source were in back, the opposite would be the case. This change in the interaural time difference is often sufficient to resolve the front/back ambiguity.
But notice what happens with a standard binaural recording. When the source is directly ahead, we receive the same signal in both the left and the right ears. Because the recording is unaware of the listener's motion, the two signals continue to be the same when we move our heads. Now, if you ask yourself where a sound source could possibly be if the sounds in the two ears remain identical regardless of head motion, the answer is “inside your head.” Dynamic cues are very powerful. Standard binaural recordings do not account for such dynamic cues, which is a major reason for the “frontal collapse.”
One way to fix these problems is to use a servomechanism to make the dummy head turn when the listener's head turns. Indeed, such a system was implemented by Horbach et al. (U. Horbach, A. Karamustafaoglu, R. Pellegrini, P. Mackensen and G. Theile, “Design and applications of a data-based auralization system for surround sound,” Preprint 4976, 106th Convention of the Audio Engineering Society (Munich, Germany, May 8-11, 1999)). They reported that their system produced extremely natural sound, and virtually eliminated front/back confusions. Although their system was very effective, it is clearly limited to use by only one listener at a time, and it cannot be used at all for recording.
There are also many Virtual-Auditory-Space systems (VAS systems) that use head-tracking methods to achieve the following advantages in rendering computer-generated sounds: (i) stable locations for virtual auditory sources, independent of the listener's head motion; (ii) good frontal externalization; and (iii) little or no front/back confusion. However, VAS systems require: (i) isolated signals for each sound source; (ii) knowledge of the location of each sound source; (iii) as many channels as there are sources; (iv) head-related transfer functions (HRTFs) to spatialize each source separately; and (v) additional signal processing to approximate the effects of room echoes and reverberation.
It is possible to apply VAS techniques to recordings intended to be heard through loudspeakers, such as stereo or surround-sound recordings. In this case, the sound sources (the loudspeakers) are isolated, and their number and locations are known. The recordings provide the separate channels and the sound sources are simulated loudspeakers located in a simulated room. The VAS system renders these sound signals just as they would render computer generated signals. Indeed, there are commercial products (such as the Sony MDR-DS8000 headphones) that employ head tracking to surround-sound recordings in just this way. However, the best that such systems can do is to recreate through headphones the experience of listening to the loudspeakers.
They are not readily applicable to live recordings, and are totally inappropriate for teleconferencing. They inherit all of the many problems of surround-sound and Ambisonic systems, save for the need for multiple loudspeakers.
There are also many methods for recording and reproducing live spatial sound using more than two microphones. However, we know of only one system for capturing live sound that is designed for headphone playback and that responds to dynamic motions of the listener. That system, which we refer to as the McGrath system, is described in U.S. Pat. Nos. 6,021,206 and 6,259,795. The primary difference between these patents is that the first concerns a single listener, while the second concerns multiple listeners. Both of these patents concern the binaural spatialization of recordings made with the SoundField microphone (F. Rumsey, Spatial Audio (Focal Press, Oxford, 2001), pp. 204-205).
The McGrath system has the following characteristics (i) when the sound is recorded, the orientation of the listener's head is unknown; (ii) the position of the listener's head is measured with a head tracker; (iii) a signal processing procedure is used to convert the multichannel recording to a binaural recording; and (iv) the main goal is to produce virtual sources whose locations do not change when the listener moves his or her head. Note that Ambisonic recording as used in the McGrath system attempts to capture the sound field that would be developed at a listener's location when the listener is absent; it does not capture the sound field at a listener's location when the listener is present. Nor does Ambisonic recording directly capture interaural time differences, interaural level differences, and spectral changes introduced by the head-related transfer function (HRTF) for a spherical-head. Thus, the McGrath system must use the recorded signals to reconstruct incoming waves from multiple directions and use HRTFs to spatialize each incoming wave separately. Although the McGrath system can employ an individualized HRTF, the system is complex and the reconstruction still suffers from all of the limitations associated with Ambisonics.