1. Technical Field
The present invention relates to the processing of audio signals, more particularly, to the encoding and reproduction of three dimensional audio soundtracks.
2. Description of the Related Art
Spatial audio reproduction has interested audio engineers and the consumer electronics industry for several decades. Spatial sound reproduction requires a two-channel or multi-channel electro-acoustic system (loudspeakers or headphones) which must be configured according to the context of application (e.g. concert performance, motion picture theater, domestic hi-fi installation, computer display, individual head-mounted display), further described in Jot, Jean-Marc, “Real-time Spatial Processing of Sounds for Music, Multimedia and Interactive Human-Computer Interfaces,” IRCAM, 1 place Igor-Stravinsky 1997, [hereinafter (Jot, 1997)], herein incorporated by reference. In association with this audio playback system configuration, a suitable technique or format must be defined to encode directional localization cues in a multi-channel audio signal for transmission or storage.
A spatially encoded soundtrack may be produced by two complementary approaches:
(a) Recording an existing sound scene with a coincident or closely-spaced microphone system (placed essentially at or near the virtual position of the listener within the scene). This can be, e.g., a stereo microphone pair, a dummy head, or a Soundfield microphone. Such a sound pickup technique can simultaneously encode, with varying degrees of fidelity, the spatial auditory cues associated to each of the sound sources present in the recorded scene, as captured from a given position.
(b) Synthesizing a virtual sound scene. In this approach, the localization of each sound source and the room effect are artificially reconstructed by use of a signal processing system, which receives individual source signals and provides a parameter interface for describing the virtual sound scene. An example of such a system is a professional studio mixing console or digital audio workstation (DAW). The control parameters may include the position, orientation and directivity of each source, along with an acoustic characterization of the virtual room or space. An example of this approach is the post-processing of a multi-track recording using a mixing console and signal processing modules such as artificial reverberators as illustrated in FIG. 1A.
The development of audio recording and reproduction techniques for the motion picture and home video entertainment industry has resulted in the standardization of multi-channel “surround sound” recording formats (most notably the 5.1 and 7.1 formats). Surround sound formats presuppose that audio channel signals should be fed respectively to loudspeakers arranged in the horizontal plane around the listener in a prescribed geometrical layout, such as the “5.1” standard layout shown in FIG. 1B (where LF, CF, RF, RS, LS and SW respectively denote the left-front, center-front, right-front, right-surround, left-surround and subwoofer loudspeakers). This assumption intrinsically limits the ability to reliably and accurately encode and reproduce three-dimensional audio cues of natural sound fields, including the proximity of sound sources and their elevation above the horizontal plane, and the sense of immersion in the spatially diffuse components of the sound field such as room reverberation.
Various audio recording formats have been developed for encoding three-dimensional audio cues in a recording. These 3-D audio formats include Ambisonics and discrete multi-channel audio formats comprising elevated loudspeaker channels, such as the NHK 22.2 format illustrated in FIG. 1C. However, these spatial audio formats are incompatible with legacy consumer surround sound playback equipment: they require different loudspeaker layout geometries and different audio decoding technology. Incompatibility with legacy equipment and installations is a critical obstacle to the successful deployment of existing 3-D audio formats.
Multi-channel Audio Coding Formats
Various multi-channel digital audio formats, such as DTS-ES and DTS-HD from DTS, Inc. of Calabasas, Calif., address these problems by including in the soundtrack data stream a backward-compatible downmix that can be decoded by legacy decoders and reproduced on existing playback equipment, and a data stream extension, ignored by legacy decoders, that carries additional audio channels. A DTS-HD decoder can recover these additional channels, subtract their contribution in the backward-compatible downmix, and render them in a target spatial audio format different from the backward-compatible format, which can include elevated loudspeaker positions. In DTS-HD, the contribution of additional channels in the backward-compatible mix and in the target spatial audio format are described by a set of mixing coefficients (one for each loudspeaker channel). The target spatial audio formats for which the soundtrack is intended must be specified at the encoding stage.
This approach allows for the encoding of a multi-channel audio soundtrack in the form of a data stream compatible with legacy surround sound decoders and one or several alternative target spatial audio formats also selected during the encoding/production stage. These alternative target formats may include formats suitable for the improved reproduction of three-dimensional audio cues. However, one limitation of this scheme is that encoding the same soundtrack for another target spatial audio format requires returning to the production facility in order to record and encode a new version of the soundtrack, that is mixed for the new format.
Object-Based Audio Scene Coding
Object-based audio scene coding offers a general solution for soundtrack encoding independent from the target spatial audio format. An example of object-based audio scene coding system is the MPEG-4 Advanced Audio Binary Format for Scenes (AABIFS). In this approach, each of the source signals is transmitted individually, along with a render cue data stream. This data stream carries time-varying values of the parameters of a spatial audio scene rendering system such as the one depicted in FIG. 1A. This set of parameters may be provided in the form of a format-independent audio scene description, such that the soundtrack may be rendered in any target spatial audio format by designing the rendering system according to this format. Each source signal, in combination with its associated render cues, defines an “audio object”. A significant advantage of this approach is that the renderer can implement the most accurate spatial audio synthesis technique available to render each audio object in any target spatial audio format selected at the reproduction end. Another advantage of object-based audio scene coding systems is that they allow for interactive modifications of the rendered audio scene at the decoding stage, including remixing, music re-interpretation (e.g. karaoke), or virtual navigation in the scene (e.g. gaming).
While object-based audio scene coding enables format-independent sound track encoding and reproduction, this approach presents two major limitations: (1) it is not compatible with legacy consumer surround sound systems; (2) it typically requires a computationally expensive decoding and rendering system; and (3) it requires a high transmission or storage data rate for carrying the multiple source signals separately.
Multi-Channel Spatial Audio Coding
The need for low-bit-rate transmission or storage of multi-channel audio signal has motivated the development of new frequency-domain Spatial Audio Coding (SAC) techniques, including Binaural Cue Coding (BCC) and MPEG-Surround. In an exemplary SAC technique, illustrated in FIG. 1D, a M-channel audio signal is encoded in the form of a downmix audio signal accompanied by a spatial cue data stream that describes, in the time-frequency domain, the inter-channel relationships present in the original M-channel signal (inter-channel correlation and level differences). Because the downmix signal comprises fewer than M audio channels and the spatial cue data rate is small compared to the audio signal data rate, this coding approach yields a significant overall data rate reduction. Additionally, the downmix format may be chosen to facilitate backward compatibility with legacy equipment.
In a variant of this approach, called Spatial Audio Scene Coding (SASC) as described in U.S. Patent Application No. 2007/0269063, the time-frequency spatial cue data transmitted to the decoder are format independent. This enables spatial reproduction in any target spatial audio format, while retaining the ability to carry a backward-compatible downmix signal in the encoded soundtrack data stream. However, in this approach, the encoded soundtrack data does not define separable audio objects. In most recordings, multiple sound sources located at different positions in the sound scene are concurrent in the time-frequency domain. In this case, the spatial audio decoder is not able to separate their contributions in the downmix audio signal. As a result, the spatial fidelity of the audio reproduction may be compromised by spatial localization errors.
Spatial Audio Object Coding
MPEG Spatial Audio Object Coding (SAOC) is similar to MPEG-Surround in that the encoded soundtrack data stream includes a backward-compatible downmix audio signal along with a time-frequency cue data stream. SAOC is a multiple object coding technique designed to transmit a number M of audio objects in a mono or two-channel downmix audio signal. The SAOC cue data stream transmitted along with the SAOC downmix signal includes time-frequency object mix cues that describe, in each frequency sub band, the mixing coefficient applied to each object input signal in each channel of the mono or two-channel downmix signal. Additionally, the SAOC cue data stream includes frequency-domain object separation cues which allow the audio objects to be post-processed individually at the decoder side. The object post-processing functions provided in the SAOC decoder mimic the capabilities of an object-based spatial audio scene rendering system and support multiple target spatial audio formats.
SAOC provides a method for low-bit-rate transmission and computationally efficient spatial audio rendering of multiple audio object signals along with an object-based and format independent three-dimensional audio scene description. However, the legacy compatibility of a SAOC encoded stream is limited to two-channel stereo reproduction of the SAOC audio downmix signal, and therefore not suitable for extending existing multi-channel surround-sound coding formats. Furthermore, it should be noted that the SAOC downmix signal is not perceptually representative of the rendered audio scene if the rendering operations applied in the SAOC decoder on the audio object signals include certain types of post-processing effects, such as artificial reverberation (because these effects would be audible in the rendering scene but are not simultaneously incorporated in the downmix signal, which contains the unprocessed object signals).
Additionally, SAOC suffers from the same limitation as the SAC and SASC techniques: the SAOC decoder cannot fully separate in the downmix signal the audio object signals that are concurrent in the time-frequency domain. For example, extensive amplification or attenuation of an object by the SAOC decoder typically yields an unacceptable decrease in the audio quality of the rendered scene.
In view of the ever increasing interest and utilization of spatial audio reproduction in entertainment and communication, there is a need in the art for an improved three-dimensional audio soundtrack encoding method and associated spatial audio scene reproduction technique.