Spatial audio reproduction has interested audio engineers and the consumer electronics industry for several decades. Spatial sound reproduction requires a two-channel or multi-channel electro-acoustic system (e.g., loudspeakers, headphones) which must be configured according to the context of the application (e.g., concert performance, motion picture theater, domestic hi-fi installation, computer display, individual head-mounted display), further described in Jot, Jean-Marc, “Real-time Spatial Processing of Sounds for Music, Multimedia and Interactive Human-Computer Interfaces,” IRCAM, 1 Place Igor-Stravinsky 1997, (hereinafter “Jot, 1997”), incorporated herein by reference.
The development of audio recording and reproduction techniques for the motion picture and home video entertainment industry has resulted in the standardization of various multi-channel “surround sound” recording formats (most notably the 5.1 and 7.1 formats). Various audio recording formats have been developed for encoding three-dimensional audio cues in a recording. These 3-D audio formats include Ambisonics and discrete multi-channel audio formats comprising elevated loudspeaker channels, such as the NHK 22.2 format.
A downmix is included in the soundtrack data stream of various multi-channel digital audio formats, such as DTS-ES and DTS-HD from DTS, Inc. of Calabasas, Calif. This downmix is backward-compatible, and can be decoded by legacy decoders and reproduced on existing playback equipment. This downmix includes a data stream extension that carries additional audio channels that are ignored by legacy decoders but can be used by non-legacy decoders. For example, a DTS-HD decoder can recover these additional channels, subtract their contribution in the backward-compatible downmix, and render them in a target spatial audio format different from the backward-compatible format, which can include elevated loudspeaker positions. In DTS-HD, the contribution of additional channels in the backward-compatible mix and in the target spatial audio format is described by a set of mixing coefficients (e.g., one for each loudspeaker channel). The target spatial audio formats for which the soundtrack is intended is specified at the encoding stage.
This approach allows for the encoding of a multi-channel audio soundtrack in the form of a data stream compatible with legacy surround sound decoders and one or more alternative target spatial audio formats also selected during the encoding/production stage. These alternative target formats may include formats suitable for the improved reproduction of three-dimensional audio cues. However, one limitation of this scheme is that encoding the same soundtrack for another target spatial audio format requires returning to the production facility in order to record and encode a new version of the soundtrack that is mixed for the new format.
Object-based audio scene coding offers a general solution for soundtrack encoding independent from the target spatial audio format. An example of object-based audio scene coding system is the MPEG-4 Advanced. Audio Binary Format for Scenes (AABIFS). In this approach, each of the source signals is transmitted individually, along with a render cue data stream. This data stream carries time-varying values of the parameters of a spatial audio scene rendering system. This set of parameters may be provided in the form of a format-independent audio scene description, such that the soundtrack may be rendered in any target spatial audio format by designing the rendering system according to this format. Each source signal, in combination with its associated render cues, defines an “audio object.” This approach enables the renderer to implement the most accurate spatial audio synthesis technique available to render each audio object in any target spatial audio format selected at the reproduction end. Object-based audio scene coding systems also allow for interactive modifications of the rendered audio scene at the decoding stage, including remixing, music re-interpretation (e.g., karaoke), or virtual navigation in the scene (e.g., video gaming).
The need for low-bit-rate transmission or storage of multi-channel audio signal has motivated the development of new frequency-domain Spatial Audio Coding (SAC) techniques, including Binaural Cue Coding (BCC) and MPEG-Surround. In an exemplary SAC technique, an M-channel audio signal is encoded in the form of a downmix audio signal accompanied by a spatial cue data stream that describes the inter-channel relationships present in the original M-channel signal (inter-channel correlation and level differences) in the time-frequency domain. Because the downmix signal comprises fewer than M audio channels and the spatial cue data rate is small compared to the audio signal data rate, this coding approach reduces the data rate significantly. Additionally, the downmix format may be chosen to facilitate backward compatibility with legacy equipment.
In a variant of this approach, called Spatial Audio Scene Coding (SASC) as described in U.S. Patent Application No. 2007/0269063, the time-frequency spatial cue data transmitted to the decoder are format independent. This enables spatial reproduction in any target spatial audio format, while retaining the ability to carry a backward-compatible downmix signal in the encoded soundtrack data stream. However, in this approach, the encoded soundtrack data does not define separable audio objects. In most recordings, multiple sound sources located at different positions in the sound scene are concurrent in the time-frequency domain. In this case, the spatial audio decoder is not able to separate their contributions in the downmix audio signal. As a result, the spatial fidelity of the audio reproduction may be compromised by spatial localization errors.
MPEG Spatial Audio Object Coding (SAOC) is similar to MPEG-Surround in that the encoded soundtrack data stream includes a backward-compatible downmix audio signal along with a time-frequency cue data stream. SAOC is a multiple object coding technique designed to transmit a number M of audio objects in a mono or two-channel downmix audio signal. The SAOC cue data stream transmitted along with the SAOC downmix signal includes time-frequency object mix cues that describe, in each frequency sub-band, the mixing coefficient applied to each object input signal in each channel of the mono or two-channel downmix signal. Additionally, the SAOC cue data stream includes frequency domain object separation cues that allow the audio objects to be post-processed individually at the decoder side. The object post-processing functions provided in the SAOC decoder mimic the capabilities of an object-based spatial audio scene rendering system and support multiple target spatial audio formats.
SAOC provides a method for low-bit-rate transmission and computationally efficient spatial audio rendering of multiple audio object signals along with an object-based and format independent three-dimensional audio scene description. However, the legacy compatibility of a SAOC encoded stream is limited to two-channel stereo reproduction of the SAOC audio downmix signal, and is therefore not suitable for extending existing multi-channel surround-sound coding formats. Furthermore, it should be noted that the SAOC downmix signal is not perceptually representative of the rendered audio scene if the rendering operations applied in the SAOC decoder on the audio object signals include certain types of post-processing effects, such as artificial reverberation (because these effects would be audible in the rendering scene but are not simultaneously incorporated in the downmix signal, which contains the unprocessed object signals).
Additionally, SAOC suffers from the same limitation as the SAC and SASC techniques: the SAOC decoder cannot fully separate in the downmix signal the audio object signals that are concurrent in the time-frequency domain. For example, extensive amplification or attenuation of an object by the SAOC decoder typically yields an unacceptable decrease in the audio quality of the rendered scene.
A spatially encoded soundtrack may be produced by two complementary approaches: (a) recording an existing sound scene with a coincident or closely-spaced microphone system (placed essentially at or near the virtual position of the listener within the scene) or (b) synthesizing a virtual sound scene.
The first approach, which uses traditional 3D binaural audio recording, arguably creates as close to the ‘you are there’ experience as possible through the use of ‘dummy head’ microphones. In this case, a sound scene is captured live, generally using an acoustic mannequin with microphones placed at the ears. Binaural reproduction, where the recorded audio is replayed at the ears over headphones, is then used to recreate the original spatial perception. One of the limitations of traditional dummy head recordings is that they can only capture live events and only from the dummy's perspective and head orientation.
With the second approach, digital signal processing (DSP) techniques have been developed to emulate binaural listening by sampling a selection of head related transfer functions (HRTFs) around a dummy head (or a human head with probe microphones inserted into the ear canal) and interpolating those measurements to approximate an HRTF that would have been measured for any location in-between. The most common technique is to convert all measured ipsilateral and contralateral HRTFs to minimum phase and to perform a linear interpolation between them to derive an HRTF pair. The HRTF pair combined with an appropriate interaural time delay (ITD) represents the HRTFs for the desired synthetic location. This interpolation is generally performed in the time domain, which typically includes a linear combination of time-domain filters. The interpolation may also include frequency domain analysis (e.g., analysis performed on one or more frequency subbands), followed by a linear interpolation between or among frequency domain analysis outputs. Time domain analysis may provide more computationally efficient results, whereas frequency domain analysis may provide more accurate results. In some embodiments, the interpolation may include a combination of time domain analysis and frequency domain analysis, such as time-frequency analysis. Distance cues may be simulated by reducing the gain of the source in relation to the emulated distance.
This approach has been used for emulating sound sources in the far-field, where interaural HRTF differences have negligible change with distance. However, as the source gets closer and closer to the head (e.g., “near-field”), the size of the head becomes significant relative to the distance of the sound source. The location of this transition varies with frequency, but convention says that the source is beyond about 1 meter (e.g., “far-field”). As the sound source goes further into the listener's near-field, interaural HRTF changes become significant, especially at lower frequencies.
Some HRTF-based rendering engines use a database of far-field HRTF measurements, which include all measured at a constant radial distance from the listener. As a result, it is difficult to emulate the changing frequency-dependent HRTF cues accurately for a sound source that is much closer than the original measurements within the far-field HRTF database.
Many modern 3D audio spatialization products choose to ignore the near-field as the complexities of modeling near-field HRTFs have traditionally been too costly and near-field acoustic events have not traditionally been very common in typical interactive audio simulations. However, the advent of virtual reality (VR) and augmented reality (AR) applications has resulted in several applications in which virtual objects will often occur closer to the user's head. More accurate audio simulations of such objects and events have become a necessity.
Previously known HRTF-based 3D audio synthesis models make use of a single set of HRTF pairs (i.e., ipsilateral and contralateral) that are measured at a fixed distance around a listener. These measurements usually take place in the far-field, where the HRTF does not change significantly with increasing distance. As a result, sound sources that are farther away can be emulated by filtering the source through an appropriate pair of far-field HRTF filters and scaling the resulting signal according to frequency-independent gains that emulate energy loss with distance (e.g., the inverse-square law).
However, as sounds get closer and closer to the head, at the same angle of incidence, the HRTF frequency response can change significantly relative to each ear and can no longer be effectively emulated with far-field measurements. This scenario, emulating the sound of objects as they get closer to the head, is particularly of interest for newer applications such as virtual reality, where closer examination and interaction with objects and avatars will become more prevalent.
Transmission of full 3D objects (e.g., audio and metadata position) has been used to enable headtracking and interaction, but such an approach requires multiple audio buffers per source and greatly increases in complexity the more sources are used. This approach may also require dynamic source management. Such methods cannot be easily integrated into existing audio formats. Multichannel mixes also have a fixed overhead for a fixed number of channels, but typically require high channel counts to establish sufficient spatial resolution. Existing scene encodings such as matrix encoding or Ambisonics have lower channel counts, but do not include a mechanism to indicate desired depth or distance of the audio signals from the listener.