Humans have only two ears, but can locate sounds in three dimensions. The brain, inner ear, and external ears work together to make inferences about audio source location. In order for a person to localize sound in three dimensions, the sound must perceptually arrive from a specific azimuth (θ), elevation (φ), and range (r). Humans estimate the source location by taking cues derived from one ear and by comparing cues received at both ears to derive difference cues based on both time of arrival differences and intensity differences. The primary cues for localizing sounds in the horizontal plane (azimuth) are binaural and based on the interaural level difference (ILD) and interaural time difference (ITD). Cues for localizing sound in the vertical plane (elevation) appear to be primarily monaural, although research has shown that elevation information can be recovered from ILD alone. The cues for range are generally the least understood, and are typically associated with room reverberation, but in the near-field there is a pronounced increase in ILD as a source comes in close to the head from approximately a meter away.
It is well known that the physical effects of the diffraction of sound waves by the human torso, shoulders, head and pinnae modify the spectrum of the sound that reaches the tympanic membrane. These changes are captured by the Head-Related Transfer Function (HRTF), which not only varies in a complex way with azimuth, elevation, range, and frequency, but also varies significantly from person to person. An HRTF is a response that characterizes how an ear receives a sound from a point in space, and a pair of these functions can be used to synthesize a binaural sound that emanates from a source location. The time-domain representation of the HRTF is known as the Head-Related Impulse Response (HRIR), and contains both amplitude and timing information that may be hidden in typical magnitude plots of the HRTF. The effects of the pinna are sometimes isolated and referred to as the Pinna-Related Transfer Function (PRTF).
HRTFs are used in certain audio products to reproduce surround sound from stereo headphones; similarly HRTF processing has been included in computer software to simulate surround sound playback from loudspeakers. To facilitate such audio processing, efforts have been made to replace measured HRTFs with certain computational models. Azimuth effects can be produced merely by introducing the proper ITD and ILD. Introducing notches into the monaural spectrum can be used to create elevation effects. More sophisticated models provide head, torso and pinna cues. Such prior efforts, however, are not necessarily optimum for reproducing newer generation audio content based on advanced spatial cues. The spatial presentation of sound utilizes audio objects, which are audio signals with associated parametric source descriptions of apparent source position (e.g., 3D coordinates), apparent source width, and other parameters. New professional and consumer-level cinema systems (such as the Dolby® Atmos™ system) have been developed to further the concept of hybrid audio authoring, which is a distribution and playback format that includes both audio beds (channels) and audio objects. Audio beds refer to audio channels that are meant to be reproduced in predefined, fixed speaker locations while audio objects refer to individual audio elements that may exist for a defined duration in time but also have spatial information describing the position, trajectory movement, velocity, and size (as examples) of each object. Thus, new spatial audio (also referred to as “adaptive audio”) formats comprise a mix of audio objects and traditional channel-based speaker feeds (beds) along with positional metadata for the audio objects.
Virtual rendering of spatial audio over a pair of speakers commonly involves the creation of a stereo binaural signal that represents the desired sound arriving at the listener's left and right ears and is synthesized to simulate a particular audio scene in three-dimensional (3D) space, containing possibly a multitude of sources at different locations. For playback through headphones rather than speakers, binaural processing or rendering can be defined as a set of signal processing operations aimed at reproducing the intended 3D location of a sound source over headphones by emulating the natural spatial listening cues of human subjects. Typical core components of a binaural renderer are head-related filtering to reproduce direction dependent cues as well as distance cues processing, which may involve modeling the influence of a real or virtual listening room or environment. In the consumer realm, audio content is increasingly being played back through small mobile devices (e.g., mp3 players, iPods, smartphones, etc.) and listened to through headphones or earbuds. Such systems are usually lightweight, compact, and low-powered and do not possess sufficient processing power to run full HRTF simulation software. Moreover, the sound field provided by headphones and similar close-coupled transducers can severely limit the ability to provide spatial cues for expansive audio content, such as may be produced by movies or computer games.
What is needed is a system that is able to provide spatial audio over headphones and other playback methods in consumer devices, such as low-power consumer mobile devices.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions.