Capture of audio content, often in conjunction with video, has become increasingly common as dedicated recording devices have become more portable and affordable and as recording capabilities have become more pervasive in everyday devices such as smartphones. The quality of video capture has consistently increased and has outpaced the quality of audio capture. Video capture on modern mobile devices is typically high-resolution and DSP-processing intensive, but accompanying audio content is generally captured in mono with low fidelity and little additional processing.
In order to capture spatial cues, many existing audio recording techniques employ at least two microphones. As a general rule, recording a 360-degree horizontal surround audio scene requires at least 3 audio channels, whereas recording a three-dimensional audio scene requires at least 4 audio channels. While multichannel audio capture is used for immersive audio recording, the more pervasive consumer audio delivery technologies and distribution frameworks currently available are limited to transmitting two-channel audio. In standard two-channel stereo reproduction, the stored or transmitted left and right audio channels are intended to be directly played back respectively on left and right loudspeakers or headphones.
For playback of immersive audio recordings, it may be necessary to render the recorded spatial audio information in a variety of playback configurations. These playback configurations include headphones, frontal sound-bar loudspeakers, frontal discrete loudspeaker pairs, 5.1 horizontal surround loudspeaker arrays, and three-dimensional loudspeaker arrays comprising height channels. Irrespective of the playback configuration, it is desirable to reproduce for the listener a spatial audio scene that is a substantially accurate representation of the captured audio scene. Additionally, it is advantageous to provide an audio storage or transmission format that is agnostic to the particular playback configuration.
One such configuration-agnostic format is the B-format. The B-format includes the following signals: (1) W—a pressure signal corresponding to the output of an omnidirectional microphone; (2) X—front-to-back directional information corresponding to the output of a forward-pointing “figure-of-eight” microphone; (3) Y—side-to-side directional information corresponding to the output of a leftward-pointing “figure-of-eight” microphone; and (4) Z—up-to-down directional information corresponding to the output of an upward-pointing “figure-of-eight” microphone.
A B-format audio signal may be spatially decoded for immersive audio playback on headphones or flexible loudspeaker configurations. A B-format signal can be obtained directly or derived from standard near-coincident microphone arrangements, which include an omnidirectional and/or bi-directional microphones or uni-directional microphones. In particular, the 4-channel A-format is obtained from a tetrahedral arrangement of cardioid microphones and may be converted to the B-format via a 4×4 linear matrix. Additionally, the 4-channel B-format may be converted to a two-channel Ambisonic UHJ format that is compatible with standard 2-channel stereo reproduction. However, the two-channel Ambisonic UHJ format is not sufficient to enable faithful three-dimensional immersive audio or horizontal surround reproduction.
Other approaches have been proposed for encoding a plurality of audio channels representing a surround or immersive sound scene into a reduced-data format for storage and/or distribution that can subsequently be decoded to enable a faithful reproduction of the original audio scene. One such approach is time-domain phase-amplitude matrix encoding/decoding. The encoder in this approach linearly combines the input channels with specified amplitude and phase relationships into a smaller set of coded channels. The decoder combines the encoded channels with specified amplitudes and phases to attempt to recover the original channels. However, as a consequence of the intermediate channel-count reduction, there can be a loss in spatial localization fidelity of the reproduced audio scene compared to the original audio scene.
An approach for improving the spatial localization fidelity of the reproduced audio scene is frequency-domain phase-amplitude matrix decoding, which decomposes the matrix-encoded two-channel audio signal into a time-frequency representation. This approach then separately spatializes the respective time-frequency components. The time-frequency decomposition provides a high-resolution representation of the input audio signals where individual sources are represented more discretely than in the time domain. As a result, this approach can improve the spatial fidelity of the subsequently decoded signal, when compared to time-domain matrix decoding.
Another approach to data reduction for multichannel audio representation is spatial audio coding. In this approach the input channels are combined into a reduced-channel format (potentially even mono) and some side information about the spatial characteristics of the audio scene is also included. The parameters in the side information can be used to spatially decode the reduced-channel format into a multichannel signal that faithfully approximates the original audio scene.
The phase-amplitude matrix encoding and spatial audio coding methods described above are often concerned with encoding multichannel audio tracks created in recording studios. Moreover, they are sometimes concerned with a requirement that the reduced-channel encoded audio signal be a viable listening alternative to the fully decoded version. This is so that direct playback is an option and a custom decoder is not required.
Sound field coding is a similar endeavor to spatial audio coding that is focused on capturing and encoding a “live” audio scene and reproducing that audio scene accurately over a playback system. Existing approaches to sound field coding depend on specific microphone configurations to capture directional sources accurately. Moreover, they rely on various analysis techniques to appropriately treat directional and diffuse sources. However, the microphone configurations required for sound field coding are often impractical for consumer devices. Modern consumer devices typically have significant design constraints imposed on the number and positions of microphones, which can result in configurations that are mismatched with the requirements for current sound field encoding methods. The sound field analysis methods are often also computationally intensive, lacking scalability to support lower-complexity realizations.