The present invention relates to digital audio signal processing, and more particularly to artificial room impulse responses for virtualization devices and methods.
Multi-channel audio is an important feature of DVD players and home entertainment systems. It provides a more realistic sound experience than is possible with conventional stereophonic systems by roughly approximating the speaker configuration found in movie theaters. FIG. 2b illustrates an example of 5-channel audio processing known as “virtual surround” which consists of creating the illusion of a 5-channel speaker system using a conventional pair of loudspeakers. This technique makes use of the impulse responses (time domain) or transfer functions (frequency domain) from each virtual loudspeaker to a listener's ears; these are the head-related impulse responses (HRIRs) or head-related transfer functions (HRTFs) and depend upon the angles and distance between the speaker and the facing direction of the listener.
By including HRIRs/HRTFs of paths with reflections and attenuations in addition to the direct path from a (virtual) speaker to a listener's ear, the virtual listening environment can be controlled. Such a combination of HRIRs/HRTFs gives a room impulse response or transfer function. A room impulse response is largely unknown, but the direct path HRTFs can be approximated by use of a library of measured HRTFs. For example, Gardner, Transaural 3-D Audio, MIT Media Laboratory Perceptual Computing Section Technical Report No. 342, Jul. 20, 1995, provides HRTFs for every 5 degrees (azimuthal). Then an artificial room impulse response/transfer function can be generated by the superposition of HRIRs/HRTFs corresponding to multiple reflection paths of the sound wave in a virtual room environment together with factors for absorption and phase change upon virtual wall reflections. A widely accepted method for simulating room acoustics called the “image method” can be used to determine a set of angles and distances of virtual speakers corresponding to wall reflections. Each virtual speaker (described by its angle and distance) can be associated with an HRIR (or its corresponding HRTF) attenuated by an amount that depends on the distance and number of reflections along its path. Therefore, the room impulse response corresponding to a speaker and its wall reflections can be obtained by summing the HRIR corresponding to the location of the original speaker with respect to the listener and the HRIRs corresponding to locations imaged by wall reflections. As the distance and number of reflections increase, the corresponding HRIR suffers a stronger attenuation that causes the room impulse response to decay slowly towards the end. An example of a room impulse response generated using this method is shown in FIG. 2h. 
The signal processing can be more explicitly described as follows. FIG. 2e shows functional blocks of the 2-speaker implementation for the 5-channel arrangement of FIG. 2b; this implementation requires cross-talk cancellation for the real speakers which appears in the lower right block in FIG. 2e. Here cross-talk denotes the signal from the right speaker that is heard at the left ear and vice-versa. The basic solution to eliminate cross-talk was proposed in U.S. Pat. No. 3,236,949 and is explained as follows. Consider a listener facing two loudspeakers as shown in FIG. 2a. Let X1(ejω) and X2(ejω) denote the (short-term) Fourier transforms of the analog signals which drive the left and right loudspeakers, respectively, and let Y1(ejω) and Y2(ejω) denote the Fourier transforms of the analog signals actually heard at the listener's left and right ears, respectively. Presuming a symmetrical speaker arrangement, the system can then be characterized by two HRTFs, H1(ejω) and H2(ejω), which respectively relate to the short and long paths from speaker to ear; that is, H1(ejω) is the transfer function from left speaker to left ear or right speaker to right ear, and H2(ejω) is the transfer function from left speaker to right ear and from right speaker to left ear. This situation can be described as a linear transformation from X1, X2 to Y1, Y2 with a 2×2 matrix having elements H1 and H2:
      [                                        Y            1                                                            Y            2                                ]    =            [                                                  H              1                                                          H              2                                                                          H              2                                                          H              1                                          ]        ⁡          [                                                  X              1                                                                          X              2                                          ]      Note that the dependence of H1 and H2 on the angle that the speakers are offset from the facing direction of the listener has been omitted.
FIG. 3 shows a cross-talk cancellation system in which the input electrical signals (short-term Fourier transformed) E1(ejω), E2(ejω) are modified to give the signals X1, X2 which drive the loudspeakers. (Note that the input signals E1 E2 are the recorded signals, typically recorded using either a pair of moderately-spaced omni-directional microphones or a pair of adjacent uni-directional microphones with an approximately 60 degree angle between the two microphone directions.) This conversion from E1, E2 into X1, X2 is also a linear transformation and can be represented by a 2×2 matrix. If the target is to reproduce signals E1, E2 at the listener's ears (so Y1=E1 and Y2=E2) and thereby cancel the effect of the cross-talk (due to H2 not being 0), then the 2×2 matrix should be the inverse of the 2×2 matrix having elements H1 and H2. That is, taking
      [                                        X            1                                                            X            2                                ]    =                              [                                                                      H                  1                                                                              H                  2                                                                                                      H                  2                                                                              H                  1                                                              ]                          -          1                    ⁡              [                                                            E                1                                                                                        E                2                                                    ]              =                            1                                    H              1              2                        -                          H              2              2                                      ⁡                  [                                                                      H                  1                                                                              -                                      H                    2                                                                                                                        -                                      H                    2                                                                                                H                  1                                                              ]                    ⁡              [                                                            E                1                                                                                        E                2                                                    ]            yields Y1=E1 and Y2=E2.
Of course, the implementation of such filters would require considerable dynamic range reduction in order to avoid saturation about frequencies with response peaks. For example, with two real speakers each 30 degrees offset as in FIG. 2a, the log magnitude of
  1            H      1      2        -          H      2      2      has the form illustrated by FIG. 2g. The range is from 0 Hz to 24000 Hz sampled every 93.75 Hz (using an FFT length of 512). The gain has been scaled so that the minimum gain is 1.0 (0 dB on the log scale). Note the large peak near 8000 Hz (near bin 90). This large peak in turn limits the available dynamic range.
For example, the left surround sound virtual speaker could be at an azimuthal angle of about 225 degrees. Thus with cross-talk cancellation, the corresponding two real speaker inputs to create the virtual left surround sound speaker would be:
      [                                        X            1                                                            X            2                                ]    =                    1                              H            1            2                    -                      H            2            2                              ⁡              [                                                            H                1                                                                    -                                  H                  2                                                                                                        -                                  H                  2                                                                                    H                1                                                    ]              ⁡          [                                                  TF              ⁢                                                          ⁢                                                3                  left                                ·                LSS                                                                                        TF              ⁢                                                          ⁢                                                3                  right                                ·                LSS                                                        ]      where H1, H2 are for the left and right real speaker angles (e.g., 30 and 330 degrees), LSS is the (short-term Fourier transform of the) left surround sound signal, and TF3left=H1(225), TF3right=H2(225) are the HRTFs for the left surround sound speaker angle (225 degrees).
Again, FIG. 2e shows functional blocks for a virtualizer with the cross-talk canceller to implement 5-channel audio with two real speakers as in FIG. 2b; each channel signal is filtered by the corresponding pair of HRTFs for the corresponding (virtual) speaker's offset angle and distance, and the filtered signals summed and input into the cross-talk canceller and the two cross-talk-cancelled outputs then drive the two real speakers.
In the case of headphones, the cross-talk problem disappears, and the filtered channel signals can directly drive the headphones as shown in FIGS. 2c and 2f. Also, FIG. 2d illustrates an approximate symmetry between forward and rear speaker locations.
Generally in multi-channel audio processing, the filtering with HRIRs or HRTFs and/or room impulse responses takes the form of many convolutions of input audio signals with long filters. Typically, a room impulse response from each (virtual) sound source to each ear is used. Since an artificial room impulse response can be several seconds long, this poses a challenging computational problem even for fast digital signal processors. Further, artificial room impulse responses need to be corrected in terms of spectral characteristics due to coloration effects introduced by HRIR filters. And external equalizers would involve additional computational overhead and possibly disrupt phase relations that are important in 3D virtualization systems.
One approach to lowering computational complexity of the filtering convolutions first transforms the input signal and the filter impulse response into the frequency domain (as by FFT) where the convolution transforms into a pointwise multiplication and then inverse transforms the product back into the time domain (as by IFFT) to recover the convolution result. The overlap-add method uses this approach with 0 padding prior to FFT to avoid circular convolution feedback. Further, for filtering with a long impulse response, the impulse response can be sectioned into shorter filters and the filtering (convolution) by each filter section separately computed and the results added to give the overall filtering output.