1. Field of the Invention
The invention relates to methods (sometimes referred to as headphone virtualization methods) and systems for generating a binaural audio signal in response to a multi-channel audio input signal, by applying a binaural room impulse response (BRIR) to each channel of a set of channels (e.g., to all channels) of the input signal, and to methods and systems for designing BRIRs for use in such methods and systems.
2. Background of the Invention
Headphone virtualization (or binaural rendering) is a technology that aims to deliver a surround sound experience or immersive sound field using standard stereo headphones.
A method for generating a binaural signal in response to a multi-channel audio input signal (or in response to a set of channels of such a signal) is sometimes referred to herein as a “headphone virtualization” method, and a system configured to perform such a method is sometimes referred to herein as a “headphone virtualizer” (or “headphone virtualization system” or “binaural virtualizer”).
Recently, the number of people enjoying music, movies, and games using headphones has grown dramatically. Portable devices offer a convenient and popular alternative to experiencing entertainment in cinema and home theaters, and headphones (including earbuds) are the primary listening means. Unfortunately, traditional headphone listening typically provides only a limited audio experience relative to that provided by other traditional presentation systems. The limitations can be attributed to significant acoustic path differences between naturally occurring soundfields and those produced by headphones. Audio content in the form of either original stereo material or multi-channel audio downmixes are perceived as significantly ellipsoidal in nature when presented in a traditional manner over headphones (the emitted sound is perceived as emitting from locations “in-the-head” and to the immediate left and right side of the ears). Most listeners have little if any sensation of front-back depth, let alone elevation. On the other hand, listening to a traditional presentation over loudspeakers is perceived in nearly all cases as “out-of-head” (well-externalized).
A primary goal of headphone virtualizers is to create a sense of natural space to stereo and multi-channel audio programs delivered by headphones. Ideally, soundfields produced over headphones are sufficiently realistic and convincing that headphone users will lose awareness that they are wearing headphones at all. The sense of space can be created by convolving appropriately-designed binaural room impulse responses (BRIRs) with each audio channel or object in the program. The processing can be applied either by the content creator or by a consumer playback device. The BRIR typically represents the impulse response of the electro-acoustic system from loudspeakers, in a given room, to the entrance of the ear canal.
Early headphone virtualizers applied a head-related transfer function (HRTF) to convey spatial information in binaural rendering. An HRTF is a direction- and distance-dependent filter pair that characterizes how sound transmits from a specific point in space (sound source location) to both ears of a listener in an anechoic environment. Essential spatial cues such as the interaural time difference (ITD), interaural level difference (ILD), head shadowing effect, and spectral peaks and notches due to shoulder and pinna reflections, can be perceived in the rendered HRTF-filtered binaural content. Due to the constraint of human head size, the HRTFs do not provide sufficient or robust cues regarding source distance beyond roughly one meter. As a result, virtualizers based solely on HRTFs usually do not achieve good externalization or perceived distance.
Most of the acoustic events in our daily life happen in reverberant environments where, in addition to the direct path (from source to ear) modeled by HRTFs, audio signals also reach a listener's ears through various reflection paths. Reflections introduce profound impact to auditory perception, such as distance, room size, and other attributes of the space. To convey this information in binaural rendering, a virtualizer needs to apply the room reverberation in addition to the cues in the direct path HRTF. A binaural room impulse response (BRIR) characterizes the transformation of audio signals from a specific point in space to the listener's ears in a specific acoustic environment. In theory, BRIRs derived from room response measurements include all acoustic cues regarding spatial perception.
FIG. 1 is block diagram of a system (20) including a headphone virtualization system of a type configured to apply a binaural room impulse response (BRIR) to each full frequency range channel (X1, . . . , XN) of a multi-channel audio input signal. The headphone virtualization system (sometimes referred to as a virtualizer) can be configured to apply a conventionally determined binaural room impulse response, BRIRi, to each channel Xi.
Each of channels X1, . . . , XN, (which may be stationary speaker channels or moving object channels) corresponds to a specific source direction (azimuth and elevation) and distance relative to an assumed listener (i.e., the direction of a direct path from an assumed position of a corresponding speaker to the assumed listener position and the distance along the direct path between the assumed listener and speaker positions), and each such channel is convolved by the BRIR for the corresponding source direction and distance. Thus, subsystem 2 is configured to convolve channel X1 with BRIR1 (the BRIR for the corresponding source direction and distance), subsystem 4 is configured to convolve channel XN with BRIRN (the BRIR for the corresponding source direction), and so on. The output of each BRIR subsystem (each of subsystems 2, . . . , 4) is a time-domain binaural audio signal including a left channel and a right channel.
The multi-channel audio input signal may also include a low frequency effects (LFE) or subwoofer channel, identified in FIG. 1 as the “LFE” channel. In a conventional manner, the LFE channel is not convolved with a BRIR, but is instead attenuated in gain stage 5 of FIG. 1 (e.g., by −3 dB or more) and the output of gain stage 5 is mixed equally (by elements 6 and 8) into each of channel of the virtualizer's binaural output signal. An additional delay stage may be needed in the LFE path in order to time-align the output of stage 5 with the outputs of the BRIR subsystems (2, . . . , 4). Alternatively, the LFE channel may simply be ignored (i.e., not asserted to or processed by the virtualizer). Many consumer headphones are not capable of accurately reproducing an LFE channel.
The left channel outputs of the BRIR subsystems are mixed (with the output of stage 5) in addition element 6, and the right channel outputs of the BRIR subsystems are mixed (with the output of stage 5) in addition element 8. The output of element 6 is the left channel, L, of the binaural audio signal output from the virtualizer, and the output of element 8 is the right channel, R, of the binaural audio signal output from the virtualizer.
System 20 may be a decoder which is coupled to receive an encoded audio program, and which includes a subsystem (not shown in FIG. 1) coupled and configured to decode the program including by recovering the N full frequency range channels (X1, . . . , XN) and the LFE channel therefrom and to provide them to elements 2, . . . , 4, and 5 of the virtualizer (which comprises elements, 2, . . . , 4, 5, 6, and 8, coupled as shown). The decoder may include additional subsystems, some of which perform functions not related to the virtualization function performed by the virtualization system, and some of which may perform functions related to the virtualization function. For example, the latter functions may include extraction of metadata from the encoded program, and provision of the metadata to a virtualization control subsystem which employs the metadata to control elements of the virtualizer system.
In some conventional virtualizers, the input signal undergoes time domain-to-frequency domain transformation into the QMF (quadrature mirror filter) domain, to generate channels of QMF domain frequency components. These frequency components undergo filtering (e.g., in QMF-domain implementations of subsystems 2, . . . , 4 of FIG. 1) in the QMF domain and the resulting frequency components are typically then transformed back into the time domain (e.g., in a final stage of each of subsystems 2, . . . , 4 of FIG. 1) so that the virtualizer's audio output is a time-domain signal (e.g., time-domain binaural audio signal).
In general, each full frequency range channel of a multi-channel audio signal input to a headphone virtualizer is assumed to be indicative of audio content emitted from a sound source at a known location relative to the listener's ears. The headphone virtualizer is configured to apply a binaural room impulse response (BRIR) to each such channel of the input signal.
The BRIR can be separated into three overlapping regions. The first region, which the inventors refer to as the direct response, represents the impulse response form a point in anechoic space to the entrance of the ear canal. This response, typically of 5 ms duration or less, is more commonly referred to as the Head-Related Transfer Function (HRTF). The second region, referred to as early reflections, contains sound reflections from objects that are closest to the sound source and the listener (e.g. floor, room walls, furniture). The last region, called the late response, is comprised of a mixture of higher-order reflections with different intensities and from a variety of directions. This region is often described by stochastic parameters such as the peak density, modal density, and energy-decay time (T60) due to its complex structures.
Early reflections are usually primary or secondary reflections and have relatively sparse temporal distribution. The micro structure (e.g., ITD and ILD) of each primary or secondary reflection is important. For later reflections (sound reflected from more than two surfaces before being incident at the listener), the echo density increases with increasing number of reflections, and the micro attributes of individual reflections become hard to observe. For increasingly later reflections, the macro structure (e.g., the reverberation decay rate, interaural coherence, and spectral distribution of the overall reverberation) becomes more important.
The human auditory system has evolved to respond to perceptual cues conveyed in all three regions. The first region (direct response) mostly determines the perceived direction of a sound source. This phenomenon is referred to as the law of the first wavefront. The second region (early reflections) has a modest effect on the perceived direction of a source, but a stronger influence on the perceived timbre and distance of the source. The third region (late response) influences the perceived environment in which the source is located. For this reason, careful study is required of the effects of all three regions on BRIR performance to achieve an optimal virtualizer design.
One approach to BRIR design is to derive all or part of each BRIR to be applied by a virtualizer from either physical room and head measurements or room and head model simulations. Typically a room or room model having very desirable acoustical properties is selected, with the aim that the headphone virtualizer replicate the compelling listening experience of the actual room. Under the assumption that the room model accurately embodies acoustical characteristics of the selected listening room, this approach produces virtualizer BRIRs that inherently apply the auditory cues essential to spatial audio perception. Such cues that are well-known in the art include interaural time difference, interaural level difference, interaural coherence, reverberation time (T60 as a function of frequency), direct-to-reverberant ratio, specific spectral peaks and notches and echo density. Under ideal BRIR measurement and headphone listening conditions, binaural renderings of multi-channel audio files based on physical room BRIRs can sound virtually indistinguishable from loudspeaker presentation in the same room.
However, a drawback of conventional methods for BRIR design is that binaural renders produced using conventionally designed BRIRs (which have been designed to match actual room BRIRs) can sound colored, muddy, and not well-externalized when auditioned in inconsistent listening environments (environments that are inconsistent with the measurement room). The root causes of this phenomenon are still an ongoing area of research and involve both aural and visual sensory input. However, what is evident is that BRIRs designed to match physical room BRIRs can modify the signal to be rendered in both desirable and undesirable ways. Even top-quality listening rooms impart spectral coloration and time-smearing to the rendered output signal. As one example, acoustic reflections from some listening rooms are lowpass in nature. This leads to low-frequency spectral notches in the rendered output signal (spectral combing). Although low-frequency spectral notches are known to aid humans in sound source localization, in headphone listening scenarios they are generally undesirable due to added spectral coloration. In an actual listening scenario using loudspeakers positioned away from the listener, the human auditory/cognition system is able to adapt to its environment so that these impairments can go unnoticed. However, when a listener receives the same acoustic signals presented over headphones in an inconsistent listening environment, such impairments become more apparent and reduce naturalness relative to a conventional stereo program.
Other considerations in BRIR design include any applicable constraints on BRIR size and length. The effective length of a typical BRIR extends to hundreds of milliseconds or longer in most acoustic environments. Direct application of BRIRs may require convolution with a filter of thousands of taps, which is computationally expensive. Without parameterization, a large memory space may be needed to store BRIRs for different source positions in order to achieve sufficient spatial resolution.
A filter having the well-known filter structure known as a feedback delay network (FDN) can be used to implement a spatial reverberator which is configured to apply simulated reverberation (i.e., a late response portion of a BRIR) to each channel of a multi-channel audio input signal, or to apply an entire (early and late portion of a) BRIR to each such channel. The structure of an FDN is simple. It comprises several branches (sometimes referred to as reverb tanks). Each reverb tank (e.g., the reverb tank comprising gain element g1 and delay line z−n1, in the FDN of FIG. 3) has a delay and gain. In a typical implementation of an FDN, the outputs from all the reverb tanks are mixed by a unitary feedback matrix and the outputs of the matrix are fed back to and summed with the inputs to the reverb tanks. Gain adjustments may be made to the reverb tank outputs, and the reverb tank outputs (or gain adjusted versions of them) can be suitably remixed for binaural playback. Natural sounding reverberation can be generated and applied by an FDN with compact computational and memory footprints. FDNs have therefore been used in virtualizers, to apply a BRIR or to supplement the direct response applied by an HRTF.
An example of a BRIR system (e.g., an implementation of one of subsystems 2, . . . , 4 of the virtualizer of FIG. 1) which employs feedback delay networks (FDNs) to apply a BRIR to an input signal channel will be described with reference to FIG. 2. The BRIR system of FIG. 2 includes analysis filterbank 202, a bank of FDNs (FDNs 203, 204, . . . , and 205), and synthesis filterbank 207, coupled as shown. Analysis filterbank 202 is configured to apply a transform to the input channel Xi to split its audio content into “K” frequency bands, where K is an integer. The filterbank domain values (output from filterbank 202) in each different frequency band are asserted to a different one of the FDNs 203, 204, . . . , 205 (there are “K” of these FDNs), which are coupled and configured to apply the BRIR to the filterbank domain values asserted thereto.
In a variation on the system shown in FIG. 2, each of FDNs 203, 204, . . . , 205 is coupled and configured to apply a late reverberation portion (or early reflection and late reverberation portions) of a BRIR to the filterbank domain values asserted thereto, and another subsystem (not shown in FIG. 2) applies the direct response and early reflection portions (or the direct response portion) of the BRIR to the input channel Xi.
With reference again to FIG. 2, each of the FDNs 203, 204, . . . , and 205, is implemented in the filterbank domain, and is coupled and configured to process a different frequency band of the values output from analysis filterbank 202, to generate left and right channel filtered signals for each band. For each band, the left filtered signal is a sequence of filterbank domain values, and right filtered signal is another sequence of filterbank domain values. Synthesis filterbank 207 is coupled and configured to apply a frequency domain-to-time domain transform to the 2K sequences of filterbank domain values (e.g., QMF domain frequency components) output from the FDNs, and to assemble the transformed values into a left channel time domain signal (indicative of left channel audio to which the BRIR has been applied) and a right channel time domain signal (indicative of right channel audio to which the BRIR has been applied).
In a typical implementation each of the FDNs 203, 204, . . . , and 205, is implemented in the QMF domain, and filterbank 202 transforms the input channel 201 into the QMF domain (e.g., the hybrid complex quadrature mirror filter (HCQMF) domain), so that the signal asserted from filterbank 202 to an input of each of FDNs 203, 204, . . . , and 205 is a sequence of QMF domain frequency components. In such an implementation, the signal asserted from filterbank 202 to FDN 203 is a sequence of QMF domain frequency components in a first frequency band, the signal asserted from filterbank 202 to FDN 204 is a sequence of QMF domain frequency components in a second frequency band, and the signal asserted from filterbank 202 to FDN 205 is a sequence of QMF domain frequency components in a “K”th frequency band. When analysis filterbank 202 is so implemented, synthesis filterbank 207 is configured to apply a QMF domain-to-time domain transform to the 2K sequences of output QMF domain frequency components from the FDNs, to generate the left channel and right channel late-reverbed time-domain signals which are output to element 210.
The feedback delay network of FIG. 3 is an exemplary implementation of FDN 203 (or 204 or 205) of FIG. 2. Although the FIG. 3 system has four reverb tanks (each including a gain stage, gi, and a delay line, z−ni, coupled to the output of the gain stage) variations thereon the system (and other FDNs employed in embodiments of the inventive virtualizer) implement more than or less than four reverb tanks.
The FDN of FIG. 3 includes input gain element 300, all-pass filter (APF) 301 coupled to the output of element 300, addition elements 302, 303, 304, and 305 coupled to the output of APF 301, and four reverb tanks (each comprising a gain element, gk (one of elements 306), a delay line, z−Mk (one of elements 307) coupled thereto, and a gain element, 1/gk (one of elements 309) coupled thereto, where 0≤k−1≤3) each coupled to the output of a different one of elements 302, 303, 304, and 305. Unitary matrix 308 is coupled to the outputs of the delay lines 307, and is configured to assert a feedback output to a second input of each of elements 302, 303, 304, and 305. The outputs of two of gain elements 309 (of the first and second reverb tanks) are asserted to inputs of addition element 310, and the output of element 310 is asserted to one input of output mixing matrix 312. The outputs of the other two of gain elements 309 (of the third and fourth reverb tanks) are asserted to inputs of addition element 311, and the output of element 311 is asserted to the other input of output mixing matrix 312.
Element 302 is configured to add the output of matrix 308 which corresponds to delay line z−n1 (i.e., to apply feedback from the output of delay line z−n1 via matrix 308) to the input of the first reverb tank. Element 303 is configured to add the output of matrix 308 which corresponds to delay line z−n2 (i.e., to apply feedback from the output of delay line z−n2 via matrix 308) to the input of the second reverb tank. Element 304 is configured to add the output of matrix 308 which corresponds to delay line z−n3 (i.e., to apply feedback from the output of delay line z−n3 via matrix 308) to the input of the third reverb tank. Element 305 is configured to add the output of matrix 308 which corresponds to delay line z−n4 (i.e., to apply feedback from the output of delay line z−n4 via matrix 308) to the input of the fourth reverb tank.
Input gain element 300 of the FDN of FIG. 3 is coupled to receive one frequency band of the transformed signal (a filterbank domain signal) which is output from analysis filterbank 202 of FIG. 3. Input gain element 300 applies a gain (scaling) factor, Gin, to the filterbank domain signal asserted thereto. Collectively, the scaling factors Gin (implemented by all the FDNs 203, 204, . . . , 205 of FIG. 3) for all the frequency bands control the spectral shaping and level.
In a typical QMF-domain implementation of the FDN of FIG. 3, the signal asserted from the output of all-pass filter (APF) 301 to the inputs of the reverb tanks is a sequence of QMF domain frequency components. To generate more natural sounding FDN output, APF 301 is applied to output of gain element 300 to introduce phase diversity and increased echo density. Alternatively, or additionally, one or more all-pass delay filters may be applied in the reverb tank feed-forward or feed-back paths depicted in FIG. 3 (e.g., in addition or replacement of delay lines z−Mk in each reverb tank; or the outputs of the FDN (i.e., to the outputs of output matrix 312).
In implementing the reverb tank delays, z−ni, the reverb delays ni should be mutually prime numbers to avoid the reverb modes aligning at the same frequency. The sum of the delays should be large enough to provide sufficient modal density in order to avoid artificial sounding output. But the shortest delays should be short enough to avoid excess time gap between the late reverberation and the other components of the BRIR.
Typically, the reverb tank outputs are initially panned to either the left or the right binaural channel. Normally, the sets of reverb tank outputs being panned to the two binaural channels are equal in number and mutually exclusive. It is also desired to balance the timing of the two binaural channels. So if the reverb tank output with the shortest delay goes to one binaural channel, the one with the second shortest delay would go the other channel.
The reverb tank delays can be different across frequency bands so as to change the modal density as a function of frequency. Generally, lower frequency bands require higher modal density, thus the longer reverb tank delays.
The amplitudes of the reverb tank gains, gi, and the reverb tank delays jointly determine the reverb decay time of the FDN of FIG. 3:T60=−3ni/log10(|gi|)/FFRM where FFRM is the frame rate of filterbank 202 (of FIG. 3). The phases of the reverb tank gains introduce fractional delays to overcome the issues related to reverb tank delays being quantized to the downsample-factor grid of the filterbank.
The unitary feedback matrix 308 provides even mixing among the reverb tanks in the feedback path.
To equalize the levels of the reverb tank outputs, gain elements 309 apply a normalization gain, 1/|gi| to the output of each reverb tank, to remove the level impact of the reverb tank gains while preserving fractional delays introduced by their phases.
Output mixing matrix 312 (also identified as matrix Mout) is a 2×2 matrix configured to mix the unmixed binaural channels (the outputs of elements 310 and 311, respectively) from initial panning to achieve output left and right binaural channels (the L and R signals asserted at the output of matrix 312) having desired interaural coherence. The unmixed binaural channels are close to being uncorrelated after the initial panning because they do not consist of any common reverb tank output. If the desired interaural coherence is Coh, where |Coh|≤1, output mixing matrix 312 may be defined as:
            M      out        =          [                                                  cos              ⁢                                                          ⁢              β                                                          sin              ⁢                                                          ⁢              β                                                                          sin              ⁢                                                          ⁢              β                                                          cos              ⁢                                                          ⁢              β                                          ]        ,            where      ⁢                          ⁢      β        =                  arcsin        ⁡                  (          Coh          )                    /      2      Because the reverb tank delays are different, one of the unmixed binaural channels would lead the other constantly. If the combination of reverb tank delays and panning pattern is identical across frequency bands, sound image bias would result. This bias can be mitigated if the panning pattern is alternated across the frequency bands such that the mixed binaural channels lead and trail each other in alternating frequency bands. This can be achieved by implementing the output mixing matrix 312 so as to have form as set forth in the previous paragraph in odd-numbered frequency bands (i.e., in the first frequency band (processed by FDN 203 of FIG. 3), the third frequency band, and so on), and to have the following form in even-numbered frequency bands (i.e., in the second frequency band (processed by FDN 204 of FIG. 3), the fourth frequency band, and so on):
      M          out      ,      alt        =      [                                        sin            ⁢                                                  ⁢            β                                                cos            ⁢                                                  ⁢            β                                                            cos            ⁢                                                  ⁢            β                                                sin            ⁢                                                  ⁢            β                                ]  where the definition of β remains the same. It should be noted that matrix 312 can be implemented to be identical in the FDNs for all frequency bands, but the channel order of its inputs may be switched for alternating ones of the frequency bands (e.g., the output of element 310 may be asserted to the first input of matrix 312 and the output of element 311 may be asserted to the second input of matrix 312 in odd frequency bands, and the output of element 311 may be asserted to the first input of matrix 312 and the output of element 310 may be asserted to the second input of matrix 312 in even frequency bands.
In the case that frequency bands are (partially) overlapping, the width of the frequency range over which matrix 312's form is alternated can be increased (e.g., it could alternated once for every two or three consecutive bands), or the value of 0 in the above expressions (for the form of matrix 312) can be adjusted to ensure that the average coherence equals the desired value to compensate for spectral overlap of consecutive frequency bands.
The inventors have recognized that it would be desirable to design BRIRs that apply (to the input signal channels) the least processing necessary to achieve natural-sounding and well-externalized audio over headphones. In typical embodiments of the present invention, this is accomplished by designing BRIRs that assimilate binaural cues that are not only important to spatial perception but also maintain naturalness of the rendered signal. Binaural cues that improve spatial perception but only at the cost of audio distortion are avoided. Many of the cues that are avoided are a direct result of acoustical effects that our physical surroundings have on the sound received by our ears. Accordingly, typical embodiments of the inventive BRIR design method incorporate room features that result in virtualizer performance gains and avoid those that cause unacceptable quality impairments. In short, rather than design a virtualizer BRIR from a room, typical embodiments design a perceptually-optimized BRIR that in turn defines a minimalistic virtual room. The virtual room selectively incorporates acoustical properties of physical spaces, but is not bound by constraints of actual rooms.