The present invention is generally related to an audio signal decorrelator, a multi channel signal processor, a five channel audio signal processor, a method for deriving an output audio signal from an input audio signal and a computer program. In particular, the present invention is directed at a convergence enhancement for acoustic echo cancellation (AEC).
In the context of telecommunications and other applications, the problem of acoustic crosstalk between a loudspeaker, which is emitting sound, and a microphone, which is active simultaneously to pick up sound from the same acoustic environment is well-known. As a solution, technology for acoustic echo cancellation (AEC) has been proposed in the past, both for reproduction of a single sound channel (“single-channel AEC”) and for reproduction of two audio channels (“stereo AEC”).
With respect to single channel AEC, reference is taken to the following publications, a detailed list of which is included in the appendix of the present application: [Hae92], [Bre99], [Kel84]. With respect to stereo AEC, reference is taken to the following publications: [Shi95], [Gae00], [Buc01], [Sug01].
FIG. 9 shows a generic diagram of an AEC application. FIG. 9 describes a typical scenario for stereo AEC. The system of FIG. 9 is designated in its entirety with 900. From a transmitting room 910, a sound source, e.g. a speaker 912, is picked up via two microphones 920, 922. A relation between the sound transmitted by the speaker 912 and the sound received by the two microphones 920, 922 is described by transfer functions g1(k), g2(k). In other words, the transfer functions g1(k), g2(k) are for example influenced by the acoustic characteristics of the transmitting room 910 (e.g. reflections) and by a distance between the speaker 912 and the two microphones 920, 922. The microphone signals xl(k), xp(k) are transmitted to a receiving room 930, and are reproduced via two loudspeakers 932, 934.
At the same time, a microphone 940 in the receiving room 930 is set up to pick up speech from another user 942, who is present in the receiving room. Sound signals emitted by the first speaker 932 couple to the microphone 940, wherein a transmission characteristic between the first speaker 932 and the microphone 940 is designated with hl(k). Also, an acoustic signal produced by the second speaker 934 is coupled to the microphone 940, wherein a transfer characteristic between the second speaker 934 and the microphone 940 is designated with hp(k).
In order to prevent sound emitted from the two speakers 932, 934 coupling into the outgoing microphone signal (which is, for example, sent back to a far end listener, e.g. a human and/or a machine), AEC 950 attempts to cancel out any contributions of the incoming signals xl(k), xp(k) from the outgoing signal e(k) by subtracting filtered versions of the incoming signals xl(k), xp(k) from the outgoing one (e.g. from the microphone signal y(k) of the microphone 940).
In other words, the received signal xl(k) is filtered using a filter function ĥl(k), and the result of the filtering is subtracted from the microphone signal y(k). Also, the signal xp(k) is filtered with a filter function ĥp(k). The result of this filtering is further subtracted from the microphone signal y(k), so that a corrected microphone signal e(k) is obtained by subtracting the filtered versions of the signals xl(k), xp(k) from the microphone signal y(k). Canceling out (or at least reducing) contributions of the incoming signals xl(k), xp(k) from the outgoing signal e(k) generally necessitates that the cancellation filters 952, 954 are dynamically adjusted by an adaptation algorithm to achieve a minimum error signal e(k), and thus optimum cancellation.
It is known that this is the case when the adapted cancellation filters 952, 954 are an accurate model of the transfer characteristics (transfer function hp(k), hl(k), or impulse response) between the emitting speakers 932, 934 and the microphone 940.
Two important areas of application for AEC are hands-free telephony (where a far-end listener is another human being located at the remote end of the telephone) or microphone signal enhancement for automatic speech recognition (ASR). In the latter case, the objective is to remove the influence of other sound reproduced in the room from the microphone signal in order to enable operation of an automatic speech recognizer with low recognition error rates. As an example, music from a HiFi setup may be removed from the input of a voice command module to allow reliable control of certain functions by spoken user commands.
It has further been shown that for the case of stereo AEC, a so-called “non-uniqueness problem” exists [Son95]: If both loudspeaker signals are strongly correlated, then the adaptive filters generally converge to a solution (ĥp(k), ĥl(k)) that does not correctly model the transfer functions hp(k), hl(k) between the speakers 932, 934 and the microphone 940, but merely optimizes echo cancellation for given particular loudspeaker signals. As a consequence, a change in the characteristics of a loudspeaker signal xl(k), xp(k) (e.g. due to a change of a geometric position of the sound source 912 in the transmitting room 910) results in a breakdown of the echo cancellation performance and necessitates a new adaptation of the cancellation filters.
As a solution of this non-uniqueness problem, various techniques have been proposed to preprocess the signals from the transmitting room 910 before their reproduction in the receiving room 930 in order to “decorrelate” them, and in this way avoid the previously discussed ambiguity.
The requirements for such preprocessing schemes can be summarized as follows:                Convergence enhancement: the preprocessing may be able to decorrelate the input signals effectively to ensure rapid and correct AEC filter convergence even for highly correlated/monophonic (input-) signals.        Subjective sound quality: since the preprocessed signals are subsequently reproduced via loudspeakers and listened to by users 942 in the receiving room 930, the preprocessing may not introduce any objectionable artifacts for the type of audio signals used. The type of audio signals may for example be speech-only for hands-free telecommunication applications, or any type of audio material including music is used for ASR input enhancement.        Implementation complexity: in order to enable economic use of preprocessing in inexpensive consumer equipment, a very low computational and memory complexity is desirable.        
A further differentiating characteristic of preprocessing techniques is the capability of generalizing to multi-channel operation, i.e., handling more than two reproduced channels of audio.
In the following, known preprocessing concepts for acoustic echo cancellation (AEC) will be described.
A first simple preprocessing method for stereo AEC was proposed by Benesty et al. (conf. [Ben98], [Mor01]), and achieves a decorrelation of the signals by adding non-linear distortions to the signals. The non-linear distortions are for example created by a half-way rectification, full-way rectification or by forming a square-root.
FIG. 10 shows a block schematic diagram and transfer functions of a preprocessing by means of a non-linearity. The graphic representation of FIG. 10 is designated in its entirety with 1000. A first graphic representation 1010 shows a block schematic diagram of a preprocessing unit using half-way rectification units 1020, 1022. In other words, FIG. 10 illustrates a decorrelation of the signals x1(k), x2(k) for a common case of half-way rectification.
A second graphical representation 1050 describes a transfer characteristic between input signals x1, x2 and output signals x1′, x2′. An abscissa 1060 describes the input values x1, x2. An ordinate 1062 describes the output values x1′, x2′. A first curve 1070, which comprises a sharp bend at the origin of the x1, x1′ coordinate system, reflects a relationship between the input value x1 and the corresponding output value x1′. A second curve 1072, which comprises a sharp bend in the origin of the x2, x2′ coordinate system, describes a transfer characteristic between the input signal x2 and the corresponding output signal x2′.
In other words, FIG. 10 illustrates the addition of non linear distortion to the input signals x1, x2 to form the output signals x1′, x2′ for the common case of a half-way rectification.
While the described scheme (of adding non linear distortions) possesses extremely low complexity, the introduced distortion products can become quite audible and objectionable, depending on the type of audio signal processed. Typically, the degradation in sound quality is considered acceptable for speech or communication applications, but not for high-quality applications for music signals.
A second known approach consists of the addition of uncorrelated noise to the signals (e.g. to the two input signals x1, x2). In [Gae98], this is achieved by perceptual audio coding/decoding of the signal, which introduces uncorrelated quantization distortion into each signal such that it is masked due to the noise shaping that is carried out inside the perceptual audio coder according to a psycho-acoustic model. In order to introduce uncorrelated noise into both channels, no joint stereo coding may be used.
A similar effect can be achieved by using a perceptually controlled watermarking scheme, e.g. based on spread spectrum modulation (conf. [Neu98]). In this case, uncorrelated spread-spectrum data signals are embedded into the original signal instead of quantization noise.
For both approaches described above, the use of an explicit psycho-acoustic model in conjunction with analysis/synthesis filterbanks is able to prevent audible distortions for arbitrary types of audio signals. However, the associated implementation complexity and the introduced delay render this approach economically unattractive for most applications.
A third published approach to AEC preprocessing is to use complementary comb filtering on two output signals, which suppresses complementary spectral parts within the signals and in this way breaks the correlation between them (conf. [Beb98]). However, this type of processing generally leads to unacceptable degradations of the stereo image perceived by human listeners, which makes the described processing unsuited for high quality applications.
Still other approaches employ time-varying time-delays or filtering which is switched on and off (conf. [Sug98], [Sug99]), or time-varying all-pass filtering (conf. [Ali05]) to produce a time-varying phase shift/signal delay between the two signals of a stereo AEC and thus “decorrelate” both signals.
U.S. Pat. Nos. 6,700,977 B2 and 6,577,731 B1 (also designated as [Sug98] and [Sug99]) describe preprocessing systems in which the output signal switches between the original signal and a time-delayed/filtered version of it. As a disadvantage, this switching process may introduce unintended artifacts into the audio signal.
U.S. Pat. No. 6,895,093 B1 (also designated as [Ali05]) describes a preprocessing system in which an all-pass preprocessor is randomly modulated in its all-pass filter variable.
While these types of preprocessing are rather unobtrusive in their effects on the audio signal as compared to other methods in general, it is difficult to achieve maximum decorrelation while guaranteeing that introducing a (varying) time/phase difference between left and right channel does not result in a perceived shift/alteration of the perceived stereo image.