The basic idea behind spatial sound is to process a sound source so that it will contain the necessary spatial attributes of a source located at a particular point in a 3D space. The listener will then perceive the sound as if it were coming from the intended location. The resulting audio is commonly referred to as virtual sound since the spatially positioned sounds are synthetically produced. Virtual spatial sound has long been an active research topic and has recently increased in popularity because of the increase in raw digital processing power. It is now possible to perform the required real-time processing on a commercial computer that once took special dedicated hardware.
When locating sound sources, listeners unknowingly determine the azimuth, elevation, and range of the source.
To determine the source azimuth (the angle between the listener's forward facing direction and the sound source) two primary cues are used, the interaural time difference (ITD) and the interaural level difference (ILD). Simply put, this means that sources outside the median plane (not directly in front of the listener) will arrive at one ear before the other (ITD) and the sound pressure level at one ear will be greater than the other (ILD). FIG. 1a shows an image of a sound source 100 as it propagates towards the listener's ears 102,103. This figure shows the extra distance the sound must travel to reach the left ear (contralateral ear) 102 (hence, the left ear has a longer arrival time). Additionally, the head will naturally reflect and absorb more of the sound wave before it reaches the left ear 102. This is referred to as a head shadow and the result is a diminished sound pressure level at the left ear 102.
The listener's pinna (outer ear) is the primary mechanism for providing elevation cues for a source, as shown in FIGS. 1b & 1c. To determine range, the loudness of the source 100 and the ratio of direct to reverberant energy are used. There are a number of other factors that can be considered, but these are the primary cues that one attempts to reproduce to accurately represent a source at a particular location in space.
Reproducing spatial sound can be done either using loudspeakers or headphones; however headphones are commonly used since they are easily controlled. A major obstacle of loudspeaker reproduction is the cross-talk that occurs between the left and right loudspeakers. Furthermore, headphone-based reproduction eliminates the need for a sweet-spot. The virtual sound synthesis techniques discussed assume headphone-based reproduction.
The most common approach for rendering virtual spatial sound is through the use of Head Related Impulse Responses (HRIRs) or their frequency domain equivalent Head Related Transfer Functions (HRTFs). These transfer functions completely characterize the changes a sound wave undergoes as it travels from the sound source to the listener's inner ear. HRTFs vary with source azimuth, elevation, range and frequency, so a complete collection of measurements are needed if a source is to be placed anywhere in a 3D space.
If the source or listener were to move so that the source position relative to the listener changes, the HRTFs need to be updated to reflect the new source position. In this implementation, a pair of left/right HRTFs are selected from a lookup table based on listener's head position/rotation and the source position. The left and right ear signals are then synthesized by filtering the audio data with these HRTF (or in the time domain by convolving the audio data with the HRIRs).
HRTFs can synthesize very realistic spatial sound. Unfortunately, since HRTFs capture the effects of the listener's head, pinna (outer ear), and possibly the torso, the resulting functions are very listener dependent. If the HRTF doesn't match the anthropometry of the listener, then it can fail to produce the virtual sounds accurately. A generalized HRTF that can be tuned for any listener continues to be an active research topic.
Another drawback of HRTF synthesis is the amount of computation required. HRTFs are rather short filters and therefore do not capture the acoustics of a room. Introducing room reflections drastically increase the computation since each reflection should be spatialized by filtering the reflection with a pair of the appropriate HRTFs.
A less individualized, but more computationally efficient implementation uses a model-based HRTF. A model strives to capture the primary localization cues as accurately as possible regardless of the listener's anthropometry. Typically, a model can be tuned to the listener's liking. One such model is the spherical head model. This model replaces the listener's head with a sphere that closely matches the listener's head diameter (where the diameter can be changed). The model produces accurate ILD changes caused by head-shadowing. The ITD can then be found from the source to listener geometry. While not the ideal case, such models can offer a close approximation. However, models are typically more computationally efficient. One major drawback is that since the spherical head model does not include pinnae (outer ears), the elevation cues are not preserved.
A recent alternative technique is Motion-Tracked Binaural (MTB) sound. As its name suggests, MTB is a generalization of binaural recordings, which offer the most realistic spatial sound reproductions as they capture all of the static localization cues including the room acoustics. This technology was developed at the Center for Image Processing and Integrated Computing (CIPIC) at U.C. Davis. The difference between MTB and other binaural recordings is that MTB captures the entire sound field (in the horizontal plane, 0 degrees elevation), thus preserving the dynamic localization cues. Unlike binaural recording which rotate with the listener head rotation, MTB stabilizes the reproduced sound field as the listener turns his head.
The MTB synthesis technique operates off of a total of either 8 or 16 audio channels (for full 360 degree sound reproduction). The channels can either be recorded live using and MTB microphone array, or they can be virtually produced using the measured response, Room Impulse Responses (RIRs), of the MTB microphone array. The conversion of a stereo audio track to the MTB signals can be done in non-realtime leaving only a small interpolation operation to be performed in real-time between the nearest and next-nearest microphone for each of the listeners ears, as shown in FIG. 1d. 
FIG. 1d shows an image of an 8-channel MTB microphone array shown as audio channels 104-111. From this figure it can be seen that the signals for the listener's left and right ears 112,113 are synthesized from the audio channels that surround the ears (the nearest and next-nearest audio channels). For the listener's head position shown, the left ear's nearest audio channel and next nearest audio channel are audio channels 104 and 105, respectively. The right ear's nearest and next nearest audio channels are audio channels 108 and 109, respectively. This technique requires very little real-time processing at the expense of slightly more storage for the additional audio channels.
What is needed is a system and method for presenting virtual spatial sound that captures realistic spatial acoustic attributes of a sound source that is computationally efficient. An audio visual player is needed that will provide for changes in spatial attributes in real time.
Many audio players today allow a user to have a library of audio files stored in memory. Furthermore, these audio files may be organized into playlists which include a list of specific audio files. For example, a playlist entitled “Classical Music” may be created which includes all of a user's classical music audio files. What is needed is a playlist that will take into account spatial attributes of audio files. Furthermore, what is needed is a way to share the playlists.
Some audio players exist that allow audio streams from remote sites to be played. Furthermore, search engines exist that allow for searching of audio and video streams available on the internet. However, opening several application windows for web browsing, identifying audio/video streams, and audio playing can be inconvenient. What is needed is an audiovisual player that provides for these multitude of tasks in a single application window. Still further, what is needed is an audiovisual player that also provides spatial sound in addition to these multitude of tasks.