1. Field of the Invention
The invention relates to methods (sometimes referred to as headphone virtualization methods) and systems for generating a binaural signal in response to a multi-channel audio input signal, by applying a binaural room impulse response (BRIR) to each channel of a set of channels (e.g., to all channels) of the input signal. In some embodiments, at least one feedback delay network (FDN) applies a late reverberation portion of a downmix BRIR to a downmix of the channels.
2. Background of the Invention
Headphone virtualization (or binaural rendering) is a technology that aims to deliver a surround sound experience or immersive sound field using standard stereo headphones.
Early headphone virtualizers applied a head-related transfer function (HRTF) to convey spatial information in binaural rendering. A HRTF is a set of direction- and distance-dependent filter pairs that characterize how sound transmits from a specific point in space (sound source location) to both ears of a listener in an anechoic environment. Essential spatial cues such as the interaural time difference (ITD), interaural level difference (ILD), head shadowing effect, spectral peaks and notches due to shoulder and pinna reflections, can be perceived in the rendered HRTF-filtered binaural content. Due to the constraint of human head size, the HRTFs do not provide sufficient or robust cues regarding source distance beyond roughly one meter. As a result, virtualizers based solely on a HRTF usually do not achieve good externalization or perceived distance.
Most of the acoustic events in our daily life happen in reverberant environments where, in addition to the direct path (from source to ear) modeled by HRTF, audio signals also reach a listener's ears through various reflection paths. Reflections introduce profound impact to auditory perception, such as distance, room size, and other attributes of the space. To convey this information in binaural rendering, a virtualizer needs to apply the room reverberation in addition to the cues in the direct path HRTF. A binaural room impulse response (BRIR) characterizes the transformation of audio signals from a specific point in space to the listener's ears in a specific acoustic environment. In theory, BRIRs include all acoustic cues regarding spatial perception.
FIG. 1 is a block diagram of one type of conventional headphone virtualizer which is configured to apply a binaural room impulse response (BRIR) to each full frequency range channel (X1, . . . , XN) of a multi-channel audio input signal. Each of channels X1, . . . , XN, is a speaker channel corresponding to a different source direction relative to an assumed listener (i.e., the direction of a direct path from an assumed position of a corresponding speaker to the assumed listener position), and each such channel is convolved by the BRIR for the corresponding source direction. The acoustical pathway from each channel needs to be simulated for each ear. Therefore, in the remainder of this document, the term BRIR will refer to either one impulse response, or a pair of impulse responses associated with the left and right ears. Thus, subsystem 2 is configured to convolve channel X1 with BRIR1 (the BRIR for the corresponding source direction), subsystem 4 is configured to convolve channel XN with BRIRN (the BRIR for the corresponding source direction), and so on. The output of each BRIR subsystem (each of subsystems 2, . . . , 4) is a time-domain signal including a left channel and a right channel. The left channel outputs of the BRIR subsystems are mixed in addition element 6, and the right channel outputs of the BRIR subsystems are mixed in addition element 8. The output of element 6 is the left channel, L, of the binaural audio signal output from the virtualizer, and the output of element 8 is the right channel, R, of the binaural audio signal output from the virtualizer.
The multi-channel audio input signal may also include a low frequency effects (LFE) or subwoofer channel, identified in FIG. 1 as the “LFE” channel. In a conventional manner, the LFE channel is not convolved with a BRIR, but is instead attenuated in gain stage 5 of FIG. 1 (e.g., by −3 dB or more) and the output of gain stage 5 is mixed equally (by elements 6 and 8) into each of channel of the virtualizer's binaural output signal. An additional delay stage may be needed in the LFE path in order to time-align the output of stage 5 with the outputs of the BRIR subsystems (2, . . . , 4). Alternatively, the LFE channel may simply be ignored (i.e., not asserted to or processed by the virtualizer). For example, the FIG. 2 embodiment of the invention (to be described below) simply ignores any LFE channel of the multi-channel audio input signal processed thereby. Many consumer headphones are not capable of accurately reproducing an LFE channel.
In some conventional virtualizers, the input signal undergoes time domain-to-frequency domain transformation into the QMF (quadrature minor filter) domain, to generate channels of QMF domain frequency components. These frequency components undergo filtering (e.g., in QMF-domain implementations of subsystems 2, . . . , 4 of FIG. 1) in the QMF domain and the resulting frequency components are typically then transformed back into the time domain (e.g., in a final stage of each of subsystems 2, . . . , 4 of FIG. 1) so that the virtualizer's audio output is a time-domain signal (e.g., time-domain binaural signal).
In general, each full frequency range channel of a multi-channel audio signal input to a headphone virtualizer is assumed to be indicative of audio content emitted from a sound source at a known location relative to the listener's ears. The headphone virtualizer is configured to apply a binaural room impulse response (BRIR) to each such channel of the input signal. Each BRIR can be decomposed into two portions: direct response and reflections. The direct response is the HRTF which corresponds to direction of arrival (DOA) of the sound source, adjusted with proper gain and delay due to distance (between sound source and listener), and optionally augmented with parallax effects for small distances.
The remaining portion of the BRIR models the reflections. Early reflections are usually primary or secondary reflections and have relatively sparse temporal distribution. The micro structure (e.g., ITD and ILD) of each primary or secondary reflection is important. For later reflections (sound reflected from more than two surfaces before being incident at the listener), the echo density increases with increasing number of reflections, and the micro attributes of individual reflections become hard to observe. For increasingly later reflections, the macro structure (e.g., the reverberation decay rate, interaural coherence, and spectral distribution of the overall reverberation) becomes more important. Because of this, the reflections can be further segmented into two parts: early reflections and late reverberations.
The delay of the direct response is the source distance from the listener divided by the speed of sound, and its level is (in absence of walls or large surfaces close to the source location) inversely proportional to the source distance. On the other hand, the delay and level of the late reverberations is generally insensitive to the source location. Due to practical considerations, virtualizers may choose to time-align the direct responses from sources with different distances, and/or compress their dynamic range. However, the temporal and level relationship among the direct response, early reflections, and late reverberation within a BRIR should be maintained.
The effective length of a typical BRIR extends to hundreds of milliseconds or longer in most acoustic environments. Direct application of BRIRs requires convolution with a filter of thousands of taps, which is computationally expensive. In addition, without parameterization, it would require a large memory space to store BRIRs for different source position in order to achieve sufficient spatial resolution. Last but not least, sound source locations may change over time, and/or the position and orientation of the listener may vary over time. Accurate simulation of such movement requires time-varying BRIR impulse responses. Proper interpolation and application of such time-varying filters can be challenging if the impulse responses of these filters have many taps.
A filter having the well-known filter structure known as a feedback delay network (FDN) can be used to implement a spatial reverberator which is configured to apply simulated reverberation to one or more channels of a multi-channel audio input signal. The structure of an FDN is simple. It comprises several reverb tanks (e.g., the reverb tank comprising gain element g1 and delay line z−n1, in the FDN of FIG. 4), each reverb tank having a delay and gain. In a typical implementation of an FDN, the outputs from all the reverb tanks are mixed by a unitary feedback matrix and the outputs of the matrix are fed back to and summed with the inputs to the reverb tanks. Gain adjustments may be made to the reverb tank outputs, and the reverb tank outputs (or gain adjusted versions of them) can be suitably remixed for multi-channel or binaural playback. Natural sounding reverberation can be generated and applied by an FDN with compact computational and memory footprints. FDNs have therefore been used in virtualizers to supplement the direct response produced by the HRTF.
For example, the commercially available Dolby Mobile headphone virtualizer includes a reverberator having FDN-based structure which is operable to apply reverb to each channel of a five-channel audio signal (having left-front, right-front, center, left-surround, and right-surround channels) and to filter each reverbed channel using a different filter pair of a set of five head related transfer function (“HRTF”) filter pairs. The Dolby Mobile headphone virtualizer is also operable in response to a two-channel audio input signal, to generate a two-channel “reverbed” binaural audio output (a two-channel virtual surround sound output to which reverb has been applied). When the reverbed binaural output is rendered and reproduced by a pair of headphones, it is perceived at the listener's eardrums as HRTF-filtered, reverbed sound from five loudspeakers at left front, right front, center, left rear (surround), and right rear (surround) positions. The virtualizer upmixes a downmixed two-channel audio input (without using any spatial cue parameter received with the audio input) to generate five upmixed audio channels, applies reverb to the upmixed channels, and downmixes the five reverbed channel signals to generate the two-channel reverbed output of the virtualizer. The reverb for each upmixed channel is filtered in a different pair of HRTF filters.
In a virtualizer, an FDN can be configured to achieve certain reverberation decay time and echo density. However, the FDN lacks the flexibility to simulate the micro structure of the early reflections. Further, in conventional virtualizers the tuning and configuration of FDNs has mostly been heuristic.
Headphone virtualizers which do not simulate all reflection paths (early and late) cannot achieve effective externalization. The inventors have recognized that virtualizers which employ FDNs that try to simulate all reflection paths (early and late) usually have no more than limited success in simulating both early reflections and late reverberation and applying both to an audio signal. The inventors have also recognized that virtualizers which employ FDNs but do not have the capability to control properly spatial acoustic attributes such as reverb decay time, interaural coherence, and direct-to-late ratio, might achieve a degree of externalization but at the price of introducing excess timbral distortion and reverberation.