The present invention relates to digital audio signal processing, and more particularly to loudspeaker and headphone virtualization and cross-talk cancellation devices and methods.
Multi-channel audio inputs designed for multiple loudspeakers can be processed to drive a single pair of loudspeakers and/or headphones to provide a perceived sound field simulating that of the multiple loudspeakers. In addition to creation of such virtual speakers for surround sound effects, signal processing can also provide changes in perceived listening room size and shape by control of effects such as reverberation.
Multi-channel audio is an important feature of DVD players and home entertainment systems. It provides a more realistic sound experience than is possible with conventional stereophonic systems by roughly approximating the speaker configuration found in movie theaters. FIG. 2b illustrates an example of multi-channel audio processing known as “virtual surround” which consists of creating the illusion of a multi-channel speaker system using a conventional pair of loudspeakers. This technique makes use of transfer functions from virtual loudspeakers to a listener's ears; that is, transfer functions made from the head-related transfer function (HRTF) of the direct path and of all the reflections of the virtual listening environment. A room transfer function is largely unknown, but the actual HRTFs (which are functions of the angles between source direction and head direction) can be approximated by use of a library of measured HRTFs. For example, Gardner, Transaural 3-D Audio, MIT Media Laboratory Perceptual Computing Section Technical Report No. 342, Jul. 20, 1995, provides HRTFs for every 5 degrees (azimuthal).
FIG. 2e shows functional blocks of an implementation for the (real plus virtual) speaker arrangement of FIG. 2b; this requires cross-talk cancellation for the real speakers as shown in the lower right of FIG. 2e. Here cross-talk denotes the signal from the right speaker that is heard at the left ear and vice-versa. The basic solution to eliminate cross-talk was proposed in U.S. Pat. No. 3,236,949 and is explained as follows. Consider a listener facing two loudspeakers as shown in FIG. 2a. Let X1(ejω) and X2(ejω) denote the (short-term) Fourier transforms of the analog signals which drive the left and right loudspeakers, respectively, and let Y1(ejω) and Y2(ejω) denote the Fourier transforms of the analog signals actually heard at the listener's left and right ears, respectively. Presuming a symmetrical speaker arrangement, the system can then be characterized by two HRTFs, H1(ejω) and H2(ejω), which respectively relate to the short and long paths from speaker to ear; that is, H1(ejω) is the transfer function from left speaker to left ear or right speaker to right ear, and H2(ejω) is the transfer function from left speaker to right ear and from right speaker to left ear. This situation can be described as a linear transformation from X1, X2 to Y1, Y2 with a 2×2 matrix having elements H1 and H2:
      [                                        Y            1                                                            Y            2                                ]    =            [                                                  H              1                                                          H              2                                                                          H              2                                                          H              1                                          ]        ⁢                  [                                        X            1                                                            X            2                                ]  Note that the dependence of H1 and H2 on the angle that the speakers are offset from the facing direction of the listener has been omitted.
FIG. 3 shows a cross-talk cancellation system in which the input electrical signals (short-term Fourier transformed) E1(ejω), E2(ejω) are modified to give the signals X1, X2 which drive the loudspeakers. (Note that the input signals E1 E2 are the recorded signals, typically using either a pair of moderately-spaced omni-directional microphones or a pair of adjacent uni-directional microphones with an angle between the two microphone directions.) This conversion from E1, E2 into X1, X2 is also a linear transformation and can be represented by a 2×2 matrix. If the target is to reproduce signals E1, E2 at the listener's ears (so Y1=E1 and Y2=E2) and thereby cancel the effect of the cross-talk (due to H2 not being 0), then the 2×2 matrix should be the inverse of the 2×2 matrix having elements H1 and H2. That is, taking
      [                                        X            1                                                            X            2                                ]    =                              [                                                                      H                  1                                                                              H                  2                                                                                                      H                  2                                                                              H                  1                                                              ]                          -          1                    ⁡              [                                                            E                1                                                                                        E                2                                                    ]              =                            1                                    H              1              2                        -                          H              1              2                                      ⁡                  [                                                                      H                  1                                                                              -                                      H                    2                                                                                                                        -                                      H                    2                                                                                                H                  1                                                              ]                    ⁡              [                                                            E                1                                                                                        E                2                                                    ]            yields Y1=E1 and Y2=E2.
An efficient implementation of the cross-talk canceller diagonalizes the 2×2 matrix having elements H1 and H2:
      [                                        H            1                                                H            2                                                            H            2                                                H            1                                ]    =                              1          2                ⁡                  [                                                    1                                            1                                                                    1                                                              -                  1                                                              ]                    ⁡              [                                                            M                0                                                    0                                                          0                                                      S                0                                                    ]              ⁡          [                                    1                                1                                                1                                              -              1                                          ]      where M0(ejω)=H1(ejω)+H2(ejω) and S0(ejω)=H1(ejω)−H2(ejω). Thus the inverse becomes simple to compute:
            [                                                  H              1                                                          H              2                                                                          H              2                                                          H              1                                          ]              -      1        =                              1          2                ⁡                  [                                                    1                                            1                                                                    1                                                              -                  1                                                              ]                    ⁡              [                                                            1                /                                  M                  0                                                                    0                                                          0                                                      1                /                                  S                  0                                                                    ]              ⁡          [                                    1                                1                                                1                                              -              1                                          ]      And the cross-talk cancellation is efficiently implemented as sum/difference detectors with the inverse filters 1/M0(ejω) and 1/S0(ejω), as shown in FIG. 4a. This structure is referred to as the “shuffler” cross-talk canceller. U.S. Pat. No. 5,333,200 discloses this plus various other cross-talk signal processing.
However, a practical problem arises in the actual implementation due to approximate nulls in the transfer functions M0(ejω)=H1(ejω)+H2(ejω) and S0(ejω)=H1(ejω)H2(ejω). The implementation of such filters would require considerable dynamic range reduction in order to avoid saturation about frequencies with response peaks. For example, with two real speakers each 30 degrees offset as in FIG. 2a, the log magnitude of
  1            H      1      2        -          H      2      2      has the form illustrated by FIG. 2g. The range is from 0 Hz to 24000 Hz sampled every 93.75 Hz (using an FFT length of 512). The gain has been scaled so that the minimum gain is 1.0 or 0 on the log scale. Note the large peak near 8000 Hz (near frequency bin 90). This large peak in turn limits the available dynamic range. The cross-referenced copending application presents a method that is a simple and effective solution to this problem based on frequency band separation of the input signal using power complementary IIR filters. This method works well for time domain implementations, and in particular when a “shuffler” cross-talk canceller as in FIG. 4a is employed.
Now with cross-talk cancellation, the FIG. 2b virtual plus real loudspeaker arrangement can be simply created by use of the HRTFs for the offset angles of the speakers. In particular, let H1(θ) and H2(θ) denote the two HRTFs for a speaker offset by angle θ (or 360−θ by symmetry) from the facing direction of the listener. Then if the (short-term Fourier transform) of the speaker signal is denoted SS, then the corresponding left and right ear signals E1 and E2 would be H1(θ)·SS and H2(θ)·SS, respectively, where θ is the angle of the speaker direction from the facing direction. These ear signals would be used as previously described for inputs to the cross-talk canceller; the cross-talk canceller outputs then drive the two real speakers to simulate a speaker an angle θ and driven by source SS.
For example, the left surround sound virtual speaker could be at an azimuthal angle of about 225 degrees. Thus with cross-talk cancellation, the corresponding two real speaker inputs to create the virtual left surround sound speaker would be:
      [                                        X            1                                                            X            2                                ]    =                    1                              H            1            2                    -                      H            2            2                              ⁡              [                                                            H                1                                                                    -                                  H                  2                                                                                                        -                                  H                  2                                                                                    H                1                                                    ]              ⁡          [                                                  TF              ⁢                                                          ⁢                                                3                  left                                ·                LSS                                                                                        TF              ⁢                                                          ⁢                                                3                  right                                ·                LSS                                                        ]      where H1, H2 are for the left and right real speaker angles (e.g., 30 and 330 degrees), LSS is the (short-term Fourier transform of the) left surround sound signal, and TF3left=H1(225), TF3right=H2(225) are the HRTFs for the left surround sound speaker angle (225 degrees).
Again, FIG. 2e shows functional blocks for a virtualizer with the cross-talk canceller to implement 5-channel audio with two real speakers as in FIG. 2b; each speaker signal is filtered by the corresponding pair of HRTFs for the speaker's offset angle and distance, and the filtered signals summed and input into the cross-talk canceller and then into the two real speakers.
The conventional scheme for reducing the computational cost of multi-channel audio processing is to minimize the number of calculations involved in each FIR filtering process and does not consider the significant overhead introduced by multi-channel processing. The scheme can be described as a set of S×2 filters, where S is the number of sources. FIG. 2h illustrates a typical filtering scheme for the left output channel when S=5. The sound sources representing input channels are denoted C0, C1, C2, C3, and C4. The filter representing the path from C0 to the left ear is denoted Ffull [C0, left], and so on. The patterns in the block representing each Ffull indicate that the filter is made up of an early arrival section and a late reverberation section.