In vehicles such as passenger vehicles or commercial vehicles, speech dialog systems are used to assist the driver or the passengers. Speech dialog systems serve, for example, to control electronic devices without the necessity of haptic operation. The electronic devices can, for example, comprise a vehicle computer or a multimedia system of the vehicle. Language spoken by the driver or passengers is received by a hands-free microphone and supplied to voice recognition.
Usage of microphones in the vehicle interior for, e.g., voice operation, telephoning, or vehicle interior communication can potentially be impaired by an acoustic coupling of speaker output from the vehicle sound system. This can lead to recognition errors in the case of speech recognition, echoes at the remote end in the case of hands-free telephoning, and feedback in the case of vehicle interior communication. Depending on the usage, the consequences can be impaired communication, increased distraction, or even disruptive noise and echoes.
If, for example, during spoken dialog in the vehicle audio signals are played back simultaneously and continuously by the vehicle's sound system, a part of the audio signals enters the hands-free microphone as acoustic feedback from the speakers and thereby disrupts speech recognition. The audio signals played back by the vehicle's sound system can, for example, comprise music, traffic messages, radio broadcasts, navigation system output, or the (artificial) speech of a speech dialog system. The interference with speech recognition can cause recognition errors that can render the dialog inefficient and cause increased distraction from the task of driving. This can trigger dissatisfaction or irritation in the driver or passengers.
A simple solution for the aforementioned problem consists of muting the audio playback of, for example, a radio during the speech dialog or telephone call in the vehicle. However, the muting of audio playback is frequently felt to be disruptive and unnecessary by vehicle users. Moreover, important information from, for example, a navigation system can be missed. Furthermore, a vehicle user can feel compelled to very rapidly react to the responses of the speech dialog system when the audio playback is simultaneously muted during responses from the speech dialog system.
Alternatively, the audio playback volume can be temporary reduced during the speech dialog. For the speech recognizer, the extent of the interference from the audio playback is indeed less but generally still large enough so that further cleanup of the microphone signal is required.
To a limited extent, the aforementioned couplings can also be reduced by design and acoustic measures. For example, microphones can be used with an appropriate directional characteristic, microphones and speakers in the vehicle interior can be appropriately arranged relative to each other, or acoustic conditions within the vehicle can be appropriately exploited.
However, since this is generally insufficient, signal processing components are employed to clean up the microphone signals. In this regard, the signal parts coupled by the speakers of the vehicle sound system into the microphones are estimated and removed from the microphone signals. Such methods are described as echo compensation or echo suppression. A widespread type of echo compensation is linear echo compensation.
With linear echo compensation, it is assumed that the microphones, speakers and their respective amplifiers are linear transmitters and that therefore the speaker noise parts in the microphone signal that are coupled into a specific microphone overlap linearly. It is furthermore assumed that these speaker noise parts result as a linear convolution of the respective speaker source signal with a respective impulse response. Each of these impulse responses refers to a specific microphone/speaker pair and characterizes the entire electroacoustic transmission path from the speaker source signal to the microphone signal. The following variables, inter alia, are therefore reflected in such an impulse response:                the frequency and phase response of the amplifier upstream from the speaker,        the frequency and phase response of the speaker,        the spatial radiation pattern of the speaker,        the acoustic transmission path from the speaker to the microphone through the vehicle interior, including reflections, diffraction, scatter, absorption, etc.,        the spatial reception pattern of the microphone, and        the frequency and phase response of the microphone.        
This impulse response is therefore also described as an LEM impulse response (loudspeaker enclosure microphone). It generally changes over time due to changes in the vehicle interior geometry (passengers and their movements, moving parts, load, etc.) as well as in the electroacoustic properties of the microphone and speakers (depending on the temperature, air pressure, humidity, age, etc.).
An algorithm for linear echo compensation adaptively estimates the LEM impulse response for every possible microphone/speaker pair. On the basis of the LEM impulse response, the coupled speaker noise parts in each microphone signal are then calculated and subtracted therefrom. The adaptation speed and effective echo suppression are limited and generally compete with each other.
Various improved techniques for echo compensation or echo suppression are known in the prior art for, e.g., simplifying echo compensation and thereby reducing the required computation. In this regard, EP 1936939 A1 discloses echo compensation in which the microphone signal is divided into sub-band signals and subjected to undersampling. A reference audio signal is output by a speaker. The reference audio signal is also subjected to undersampling, and undersampled sub-band signals of the reference audio signal are saved. Moreover, echoes in the microphone sub-band signals are estimated, and the estimated echoes are removed from the microphone sub-band signals to obtain improved microphone sub-band signals.
With echo compensation, frequently existing multiple channels of the audio signal to be output are, however, problematic. The multichannel audio signal can, for example, be a stereo signal or a surround signal in the vehicle.
In the event of a plurality of audio source signals from a plurality of speakers, the following problem also occurs in addition to the increased calculation complexity: Given the correlations between the different audio source signals, the estimation problem is mathematically under-determined. As a consequence, when audio source signals suddenly occur, the effectiveness of echo compensation can be strongly reduced. It can even occur that the LEM estimation diverges, for example when changes in the surround sound pattern occur. This can occur, for example, when so-called phantom sound sources appear, disappear or move within the surround panorama.
Various approaches exist for circumventing this which, however, either lead to audible distortions or are very computation-intensive (watermarking, Kalman filter solutions).
In addition, an echo suppressor, for example, is known in this context from DE 102008027848 A1 that works together with a sound output device having a multichannel audio unit. The sound output device sends out output sound signals as analog signals from multiple channels through a plurality of speakers. A microphone detects an outside sound and generates an input sound signal as an analog signal. The outside sound comprises the output sound signals as an echo. The echo suppressor possesses an echo deletion function to remove the echo from the input sound signal. For this, the echo suppressor receives the output sound signals from the sound output device. Such a solution for compensating multichannel acoustic echo sources is, however, very technically complex and requires much computing power. Furthermore, there are no explicit solutions for numbers of channels that exceed two.
Another option is an improved separation of speech signals from general interfering signals. The general interfering signals can also comprise multichannel audio playbacks. This is, for example, considered in DE 102009051508 A1. To reduce interfering signals in speech recognition, a microphone array is installed instead of a single microphone. A multichannel speech signal is recorded by the microphone array and is supplied to an echo compensation unit instead of a single speech signal. Before being entered into the echo compensation unit, the multichannel speech signal recorded by the microphone array is processed further in a unit downstream from the microphone array for processing the microphone signals by a delayed summing of the signals. This separates the signals from the authorized speakers, and all other speaker signals and interfering signals are reduced. In addition, the echo compensation unit evaluates the propagation time of the different channels of the multichannel speech signal and removes all parts of the signal that, according to their propagation time, do not originate from the location of the authorized speaker. The use of a microphone array or a plurality of microphones, however, increases cost, necessitates more installation space and requires powerful computing resources.