A typical approach to stereo and surround audio transmission is loudspeaker-channel-based. In such, the stereo content or horizontal surround or 3D surround content is produced, encoded, and transmitted as a group of individual channels to be decoded and reproduced at the receiver end. A straightforward method is to encode each of the channels individually, for example, using MPEG Advanced Audio Coding (AAC), which is a common approach in commercial systems. More recently, bit-rate efficient multi-channel audio coding systems have emerged, such as MPEG Surround and that in MPEG-H Part 3: 3D Audio. They employ methods to combine the audio channels to a lesser number of audio channels for transmission. Alongside the lesser number of audio channels, dynamic spatial metadata is transmitted, which effectively has the information how to re-synthesize a multi-channel audio signal having a close perceptual resemblance to the original multi-channel signal. Such audio coding can be referred to as parametric multi-channel audio coding.
Some of the parametric spatial audio coding systems, such as MPEG-H Part 3: 3D audio, provide also an option to transmit audio objects, which are audio channels with a potentially dynamically changing location. The audio objects can be reproduced, for example, using amplitude panning techniques at the receiver end. It can be considered that for professional multi-channel audio productions the aforementioned techniques are well suited.
The use case of virtual reality (VR) audio (definition here including array-captured spatial audio and augmented reality audio) is typically fundamentally different. In specific, it is typical that the audio content is fully or partly retrieved from an array of microphones integrated to the presence capture device, such as a spherical multi-lens camera, or an array near the camera. The audio capture techniques in this context differ from classical recording techniques. For example, in a manner similar to a radar or radio communication, it is possible to use array signal processing techniques for audio signals to detect information of the sound scene that has perceptual significance. This includes the direction(s) of the arriving sounds (sometimes coinciding with the directions of the sources in the scene), and the ratios between the directional energy, and other kinds of sound energy, such as background ambience, reverberation, noise, or similar. Such, or similar parameters we refer to as dynamic spatial audio capture (SPAC) metadata. There exist several known methods of array signal processing to estimate SPAC metadata. In contrast to classical loudspeaker-channel based systems, in this case the direction can be any spatial direction, and there may be no resemblance with respect to any particular loudspeaker setup. A digital signal processing (DSP) system can be implemented to use this metadata and the microphone signals to synthesize the spatial sound perceptually accurately to any surround or 3D surround setup, or to headphones by applying binaural processing techniques. There exist several high-quality options for the DSP systems to perform such rendering. We refer to such a process as SPAC rendering. It is to be noted that the SPAC metadata, SPAC rendering, and the efficient multi-channel audio coding are always performed in frequency bands, because the human spatial hearing is known to decode the spatial image based on spatial information in frequency bands.
A traditional and straightforward approach for SPAC audio transmission would be to perform the SPAC rendering to produce a 3D-surround mix, and to apply the multi-channel audio coding techniques to transmit the audio. However, this approach is not optimal. Firstly, for headphone binaural rendering, applying an intermediate loudspeaker layout inevitably means using amplitude panning techniques, because the sources do not coincide with the directions of the loudspeakers. With headphone binaural use, which is the main use case of VR audio, we do not need to restrict the decoding in such a way. A sound can be decoded at any directions using a high-resolution set of head-related transfer functions (HRTFs). Amplitude-panned sources are perceived less point-like and often also spectrally imbalanced when compared to direct HRTF rendering. Secondly, having sufficient reproduction in 3D using the intermediate loudspeaker representation, we need to transmit a high number of audio channels. The modern multi-channel audio coding techniques mitigate this effect by combining the audio channels, however, applying such methods in minimum adds layers of unnecessary audio processing steps, which at least reduces the computational efficiency, but potentially also audio fidelity.
The Nokia VR Audio format, for which the methods described herein are relevant, is defined specifically for VR use. The SPAC metadata itself is transmitted alongside a set of audio channels obtained from microphone signals. The SPAC decoding takes place at the receiver end to the given setup, being loudspeakers or headphones. Thus, the audio can be decoded as point-like sources at any direction, and the computational overhead is minimum. Furthermore, the format is defined to support various microphone-array types supporting different levels of spatial analysis. For example, with some array processing techniques one can accurately analyse a single prominent spectrally overlapping source, while other techniques can detect two or more, which can provide perceptual benefit at complex sound scenes. Thus, the VR-audio format is defined flexible with respect to the number of simultaneous analysed directions. This feature of Nokia's VR audio format is the most relevant for the methods described herein. For completeness, the VR audio format also provides support for transmission of other signal types such as audio-object signals and loudspeaker signals as additional tracks with separate audio-channel based spatial metadata.
The present methods focus on reducing or limiting the number of transmitted audio channels in context of VR audio transmission. As a key feature, the present methods take advantage of the aforementioned flexible definition of the spatial audio capture (SPAC) metadata in Nokia VR audio format. As an overview, the present methods allow to mix in additional audio channel(s) such as audio object signals within the SPAC signals, in such a way that the number of the channels is not increased. However, the processing is formulated such that the spatial fidelity is well preserved. This property is obtained with taking benefit of the flexible definition of the number of simultaneous SPAC directions. The added signals add layers to the SPAC metadata as simultaneous directions being potentially different from the original existing SPAC directions. As the result, the merged SPAC stream is such that has both the original microphone-captured audio signals as well as the in-mixed audio signals, and the spatial metadata is expanded to cover both. As the result, the merged SPAC stream can be decoded at the receiver side with the high spatial fidelity.
It is to be noted here that an existing technical alternative to merging the SPAC and other streams, for example an audio object, would be to process and add the audio-object signal to the microphone-array signals in such a way that it resembles a plane wave arriving to the array from the specified direction of the object. However, it is well known in the field of array signal processing that having simultaneous spectrally overlapping sources at the sound scene makes the spatial analysis less reliable, which typically affects the spatial precision of the decoded sound. As another alternative, the object signals could be also transmitted as additional audio tracks, and rendered at the receiver end. This solution yields better reproduction quality, but also a higher number of transmitted channels, i.e., higher bit rate and higher computational load at the decoder.
Thus, there is a need to develop solutions which enable a high quality rendering process without a significantly higher computational loading/storage and transmission capacity requirements found in the prior art.
In the following the background is given for a use case in which SPAC and audio objects are used simultaneously. Capture of audio signals from multiple sources and mixing of those audio signals when these sources are moving in the spatial field requires significant effort. For example the capture and mixing of an audio signal source such as a speaker or artist within an audio environment such as a theatre or lecture hall to be presented to a listener and produce an effective audio atmosphere requires significant investment in equipment and training.
A commonly implemented system would be for a professional producer to utilize an external or close microphone, for example a Lavalier microphone worn by the user or a microphone attached to a boom pole to capture audio signals close to the speaker or other sources, and then manually mix this captured audio signal with a suitable spatial (or environmental or audio field) audio signal such that the produced sound comes from an intended direction. As would be expected manually positioning a sound source within the spatial audio field requires significant time and effort to do.
Modern array signal processing techniques have emerged that enable, instead of manual recording, an automated recording of spatial scenes, and perceptually accurate reproduction using loudspeakers or headphones. However, in such recording, often it is necessary to enhance the audio signals. For example the audio signals may be enhanced for clarification of information or intelligibility purposes. Thus, in a news broadcast, the end user may like to get more clarity on the audio from news reporter rather than any background ‘noise’.