The invention relates to a method and apparatus for generating a binaural audio signal and in particular, but not exclusively, to generation of a binaural audio signal from a mono downmix signal.
In the last decade there has been a trend towards multi-channel audio and specifically towards spatial audio extending beyond conventional stereo signals. For example, traditional stereo recordings only comprise two channels whereas modern advanced audio systems typically use five or six channels, as in the popular 5.1 surround sound systems. This provides for a more involved listening experience where the user may be surrounded by sound sources.
Various techniques and standards have been developed for communication of such multi-channel signals. For example, six discrete channels representing a 5.1 surround system may be transmitted in accordance with standards such as the Advanced Audio Coding (AAC) or Dolby Digital standards.
However, in order to provide backwards compatibility, it is known to downmix the higher number of channels to a lower number, and specifically it is frequently used to downmix a 5.1 surround sound signal to a stereo signal allowing a stereo signal to be reproduced by legacy (stereo) decoders and a 5.1 signal by surround sound decoders.
One example is the MPEG2 backwards compatible coding method. A multi-channel signal is downmixed into a stereo signal. Additional signals are encoded in the ancillary data portion allowing an MPEG2 multi-channel decoder to generate a representation of the multi-channel signal. An MPEG1 decoder will disregard the ancillary data and thus only decode the stereo downmix.
There are several parameters which may be used to describe the spatial properties of audio signals. One such parameter is the inter-channel cross-correlation, such as the cross-correlation between the left channel and the right channel for stereo signals. Another parameter is the power ratio of the channels. In so-called (parametric) spatial audio (en)coders, these and other parameters are extracted from the original audio signal in order to produce an audio signal having a reduced number of channels, for example only a single channel, plus a set of parameters describing the spatial properties of the original audio signal. In so-called (parametric) spatial audio decoders, the spatial properties as described by the transmitted spatial parameters are re-instated.
3D sound source positioning is currently gaining interest, especially in the mobile domain. Music playback and sound effects in mobile games can add significant value to the consumer experience when positioned in 3D, effectively creating an ‘out-of-head’ 3D effect. Specifically, it is known to record and reproduce binaural audio signals which contain specific directional information to which the human ear is sensitive. Binaural recordings are typically made using two microphones mounted in a dummy human head, so that the recorded sound corresponds to the sound captured by the human ear and includes any influences due to the shape of the head and the ears. Binaural recordings differ from stereo (that is, stereophonic) recordings in that the reproduction of a binaural recording is generally intended for a headset or headphones, whereas a stereo recording is generally made for reproduction by loudspeakers. While a binaural recording allows a reproduction of all spatial information using only two channels, a stereo recording would not provide the same spatial perception.
Regular dual channel (stereophonic) or multiple channel (e.g. 5.1) recordings may be transformed into binaural recordings by convolving each regular signal with a set of perceptual transfer functions. Such perceptual transfer functions model the influence of the human head, and possibly other objects, on the signal. A well-known type of spatial perceptual transfer function is the so-called Head-Related Transfer Function (HRTF). An alternative type of spatial perceptual transfer function, which also takes into account reflections caused by the walls, ceiling and floor of a room, is the Binaural Room Impulse Response (BRIR).
Typically, 3D positioning algorithms employ HRTFs (or BRIRs), which describe the transfer from a certain sound source position to the eardrums by means of an impulse response. 3D sound source positioning can be applied to multi-channel signals by means of HRTFs thereby allowing a binaural signal to provide spatial sound information to a user for example using a pair of headphones.
A conventional binaural synthesis algorithm is outlined in FIG. 1. A set of input channels is filtered by a set of HRTFs. Each input signal is split in two signals (a left ‘L’, and a right ‘R’ component); each of these signals is subsequently filtered by an HRTF corresponding to the desired sound source position. All left-ear signals are subsequently summed to generate the left binaural output signal, and the right-ear signals are summed to generate the right binaural output signal.
Decoder systems are known that can receive a surround sound encoded signal and generate a surround sound experience from a binaural signal. For example, headphone systems are known which allow a surround sound signal to be converted to a surround sound binaural signal for providing a surround sound experience to the user of the headphones.
FIG. 2 illustrates a system wherein an MPEG surround decoder receives a stereo signal with spatial parametric data. The input bit stream is de-multiplexed by a demultiplexer (201) resulting in spatial parameters and a downmix bit stream. The latter bit stream is decoded using a conventional mono or stereo decoder (203). The decoded downmix is decoded by a spatial decoder (205), which generates a multi-channel output based on the transmitted spatial parameters. Finally, the multi-channel output is then processed by a binaural synthesis stage (207) (similar to that of FIG. 1) resulting in a binaural output signal providing a surround sound experience to the user.
However, such an approach is complex and necessitates substantial computational resource and may further reduce audio quality and introduce audible artifacts.
In order to overcome some of these disadvantages, it has been proposed that a parametric multi-channel audio decoder can be combined with a binaural synthesis algorithm such that a multi-channel signal can be rendered in headphones without requiring that the multi-channel signal is first generated from the transmitted downmix signal followed by a downmix of the multi-channel signal using HRTF filters.
In such decoders, the upmix spatial parameters for recreating the multi-channel signal are combined with the HRTF filters in order to generate combined parameters which can directly be applied to the downmix signal to generate the binaural signal. In order to do so, the HRTF filters are parameterized.
An example of such a decoder is illustrated in FIG. 3 and further described in Breebaart, J. “Analysis and synthesis of binaural parameters for efficient 3D audio rendering in MPEG Surround”, Proc. ICME, Beijing, China (2007) and Breebaart, J., Faller, C. “Spatial audio processing: MPEG Surround and other applications”, Wiley & Sons, New York (2007).
An input bitstream containing spatial parameters and a downmix signal is received by a demultiplexer 301. The downmix signal is decoded by a conventional decoder 303 resulting in a mono or stereo downmix.
Additionally, HRTF data are converted to the parameter domain by means of a HRTF parameter extraction unit 305. The resulting HRTF parameters are combined in a conversion unit 307 to generate combined parameters referred to as binaural parameters. These parameters describe the combined effect of the spatial parameters and the HRTF processing.
The spatial decoder synthesizes the binaural output signal by modifying the decoded downmix signal dependent on the binaural parameters. Specifically, the downmix signal is transferred to a transform or filter bank domain by a transform unit 309 (or the conventional decoder 303 may directly provide the decoded downmix signal as a transform signal). The transform unit 309 can specifically comprise a QMF filter bank to generate QMF subbands. The subband downmix signal is fed to a matrix unit 311 which performs a 2×2 matrix operation in each sub band.
If the transmitted downmix is a stereo signal the two input signals to the matrix unit 311 are the two stereo signals. If the transmitted downmix is a mono signal one of the input signals to the matrix unit 311 is the mono signal and the other signal is a decorrelated signal (similar to conventional upmixing of a mono signal to a stereo signal).
For both the mono and stereo downmixes, the matrix unit 311 performs the operation:
            [                                                  y                              L                B                                            n                ,                k                                                                                        y                              R                B                                            n                ,                k                                                        ]        =                  [                                                            h                11                                  n                  ,                  k                                                                                    h                12                                  n                  ,                  k                                                                                                        h                21                                  n                  ,                  k                                                                                    h                22                                  n                  ,                  k                                                                    ]            ⁡              [                                                            y                                  L                  0                                                  n                  ,                  k                                                                                                        y                                  R                  0                                                  n                  ,                  k                                                                    ]              ,where k is the sub-band index number, n the slot (transform interval) index number, hijn,k the matrix elements for sub-band k, yL0n,k,yR0n,k the two input signals for sub-band k, and yLBn,k,yRBn,k the binaural output signal samples.
The matrix unit 311 feeds the binaural output signal samples to an inverse transform unit 313 which transforms the signal back to the time domain. The resulting time domain binaural signal can then be fed to headphones to provide a surround sound experience.
The described approach has a number of advantages:
The HRTF processing can be performed in the transform domain which in many cases can reduce the number of transforms as the same transform domain may be used for decoding the downmix signal.
The complexity of the processing is very low (it uses only multiplication by 2×2 matrices) and is virtually independent on the number of simultaneous audio channels.
It can be applied to both mono and stereo downmixes; HRTFs are represented in a very compact manner and hence can be transmitted and stored very efficiently.
However, the approach also has some disadvantages. Specifically, the approach is only suitable for HRTFs having a relatively short impulse responses (generally less than the transform interval) as longer impulse responses cannot be represented by the parameterised subband HRTF values. Thus, the approach is not usable for audio environments having long echoes or reverberations. Specifically, the approach typically does not work with echoic HRTFs or Binaural Room Impulse Responses (BRIRs) which can be long and thus very hard to correctly model with the parametric approach.
Hence, an improved system for generating a binaural audio signal would be advantageous and in particular a system allowing increased flexibility, improved performance, facilitated implementation, reduced resource usage and/or improved applicability to different audio environments would be advantageous.