The present invention relates to audio processing and, in particular, to an apparatus and method for generating a merged audio data stream is provided.
Audio processing and, in particular, spatial audio coding, becomes more and more important. Traditional spatial sound recording aims at capturing a sound field such that at the reproduction side, a listener perceives the sound image as it was at the recording location. Different approaches to spatial sound recording and reproduction techniques are known from the state of the art, which may be based on channel-, object- or parametric representations.
Channel-based representations represent the sound scene by means of N discrete audio signals meant to be played back by N loudspeakers arranged in a known setup, e.g. a 5.1 surround sound setup. The approach for spatial sound recording usually employs spaced, omnidirectional microphones, for example, in AB stereophony, or coincident directional microphones, for example, in intensity stereophony. Alternatively, more sophisticated microphones, such as a B-format microphone, may be employed, for example, in Ambisonics, see:    [1] Michael A. Gerzon. Ambisonics in multichannel broadcasting and video. J. Audio Eng. Soc, 33(11):859-871, 1985.
The desired loudspeaker signals for the known setup are derived directly from the recorded microphone signals and are then transmitted or stored discretely. A more efficient representation is obtained by applying audio coding to the discrete signals, which in some cases codes the information of different channels jointly for increased efficiency, for example in MPEG-Surround for 5.1, see:    [21] J. Herre, K. Kjörling, J. Breebaart, C. Faller, S. Disch, H. Purnhagen, J. Koppens, J. Hilpert, J. Roden, W. Oomen, K. Linzmeier, K. S. Chong: “MPEG Surround—The ISO/MPEG Standard for Efficient and Compatible Multichannel Audio Coding”, 122nd AES Convention, Vienna, Austria, 2007, Preprint 7084.
A major drawback of these techniques is, that the sound scene, once the loudspeaker signals have been computed, cannot be modified.
Object-based representations are, for example, used in Spatial Audio Object Coding (SAOC), see    [25] Jeroen Breebaart, Jonas Engdegård, Cornelia Falch, Oliver Hellmuth, Johannes Hilpert, Andreas Hoelzer, Jeroens Koppens, Werner Oomen, Barbara Resch, Erik Schuijers, and Leonid Terentiev. Spatial audio object coding (saoc)—the upcoming mpeg standard on parametric object based audio coding. In Audio Engineering Society Convention 124, 5 2008.
Object-based representations represent the sound scene with N discrete audio objects. This representation gives high flexibility at the reproduction side, since the sound scene can be manipulated by changing e.g. the position and loudness of each object. While this representation may be readily available from an e.g. multitrack recording, it is very difficult to be obtained from a complex sound scene recorded with a few microphones (see, for example, [21]). In fact, the talkers (or other sound emitting objects) have to be first localized and then extracted from the mixture, which might cause artifacts.
Parametric representations often employ spatial microphones to determine one or more audio downmix signals together with spatial side information describing the spatial sound. An example is Directional Audio Coding (DirAC), as discussed in    [29] Ville Pulkki. Spatial sound reproduction with directional audio coding. J. Audio Eng. Soc, 55(6):503-516, June 2007.
The term “spatial microphone” refers to any apparatus for the acquisition of spatial sound capable of retrieving direction of arrival of sound (e.g. combination of directional microphones, microphone arrays, etc.).
The term “non-spatial microphone” refers to any apparatus that is not adapted for retrieving direction of arrival of sound, such as a single omnidirectional or directive microphone.
Another example is proposed in:    [4] C. Faller. Microphone front-ends for spatial audio coders. In Proc. of the AES 125th International Convention, San Francisco, October 2008.
In DirAC, the spatial cue information comprises the direction of arrival (DOA) of sound and the diffuseness of the sound field computed in a time-frequency domain. For the sound reproduction, the audio playback signals can be derived based on the parametric description. These techniques offer great flexibility at the reproduction side because an arbitrary loudspeaker setup can be employed, because the representation is particularly flexible and compact, as it comprises a downmix mono audio signal and side information, and because it allows easy modifications on the sound scene, for example, acoustic zooming, directional filtering, scene merging, etc.
However, these techniques are still limited in that the spatial image recorded is relative to the spatial microphone used. Therefore, the acoustic viewpoint cannot be varied and the listening-position within the sound scene cannot be changed.
A virtual microphone approach is presented in    [22] Giovanni Del Galdo, Oliver Thiergart, Tobias Weller, and E. A. P. Habets. Generating virtual microphone signals using geometrical information gathered by distributed arrays. In Third Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA '11), Edinburgh, United Kingdom, May 2011.
which allows to compute the output signals of an arbitrary spatial microphone virtually placed at will (i.e., arbitrary position and orientation) in the environment. The flexibility characterizing the virtual microphone (VM) approach allows the sound scene to be virtually captured at will in a postprocessing step, but no sound field representation is made available, which can be used to transmit and/or store and/or modify the sound scene efficiently. Moreover only one source per time-frequency bin is assumed active, and therefore, it cannot correctly describe the sound scene if two or more sources are active in the same time-frequency bin. Furthermore, if the virtual microphone (VM) is applied at the receiver side, all the microphone signals need to be sent over the channel, which makes the representation inefficient, whereas if the VM is applied at the transmitter side, the sound scene cannot be further manipulated and the model loses flexibility and becomes limited to a certain loudspeaker setup. Moreover, it does not considers a manipulation of the sound scene based on parametric information.
In    [24] Emmanuel Gallo and Nicolas Tsingos. Extracting and re-rendering structured auditory scenes from field recordings. In AES 30th International Conference on Intelligent Audio Environments, 2007,
the sound source position estimation is based on pairwise time difference of arrival measured by means of distributed microphones. Furthermore, the receiver is dependent on the recording and necessitates all microphone signals for the synthesis (e.g., the generation of the loudspeaker Signals).
The method presented in    [28] Svein Berge. Device and method for converting spatial audio signal. U.S. patent application Ser. No. 10/547,151,
uses, similarly to DirAC, direction of arrival as a parameter, thus limiting the representation to a specific point of view of the sound scene. Moreover, it does not propose the possibility to transmit/store the sound scene representation, since the analysis and synthesis need both to be applied at the same side of the communication system.
Another example can be videoconferencing applications, in which parties that are being recorded in different environments need to be played back in a unique sound scene. A Multipoint Control Unit (MCU) has to make sure that a unique sound scene is played back.
In    [22] G. Del Galdo, F. Kuech, M. Kallinger, and R. Schultz-Amling. Efficient merging of multiple audio streams for spatial sound reproduction in directional audio coding. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2009), 2009.
and in    [23] US 20110216908: Apparatus for Merging Spatial Audio Streams
the idea of combining two or more parametric representations of a sound scene has been proposed
However, it would be highly beneficial, if concepts would be provided to create a unique sound scene from two or more sound scene representations in an efficient way, flexible enough to modify the sound scene.