As anyone who has ridden in a mini-van, sedan or sport utility vehicle will know, communication among the passengers in the cabin of such a vehicle is difficult. For example, in such a vehicle, it is frequently difficult for words spoken by, for example, a passenger in a back seat to be heard and understood by the driver, or vice versa, due to the large amount of ambient noise caused by the motor, the wind, other vehicles, stationary structures passed by etc., some of which noise is caused by the movement of the cabin and some of which occurs even when the cabin is stationary, and due to the cabin acoustics which may undesirably amplify or damp out different sounds. Even in relatively quiet vehicles, communication between passengers is a problem due to the distance between passengers and the intentional use of sound-absorbing materials to quiet the cabin interior. The communication problem may be compounded by the simultaneous use of high-fidelity stereo systems for entertainment.
To amplify the spoken voice, it may be picked up by a microphone and played back by a loudspeaker. However, if the spoken voice is simply picked up and played back, there will be a positive feedback loop that results from the output of the loudspeaker being picked up again by the microphone and added to the spoken voice to be once again output at the loudspeaker. When the output of the loudspeaker is substantially picked up by a microphone, the loudspeaker and the microphone are said to be acoustically coupled. To avoid an echo due to the reproduced voice itself, an echo cancellation apparatus, such as an acoustic echo cancellation apparatus, can be coupled between the microphone and the loudspeaker to remove the portion of the picked-up signal corresponding to the voice component output by the loudspeaker. This is possible because the audio signal at the microphone corresponding to the original spoken voice is theoretically highly correlated to the audio signal at the microphone corresponding to the reproduced voice component in the output of the loudspeaker. One advantageous example of such an acoustic echo cancellation apparatus is described in commonly-assigned U.S. patent application Ser. No. 08/868,212. Another advantageous acoustic echo cancellation apparatus is described hereinbelow.
On the other hand, any reproduced noise components may not be so highly correlated and need to be removed by other means. However, while systems for noise reduction generally are well known, enhancing speech intelligibility in a noisy cabin environment poses a challenging problem due to constraints peculiar to this environment. It has been determined in developing the present invention that the challenges arise principally, though not exclusively, from the following five causes. First, the speech and noise occupy the same bandwidth, and therefore cannot be separated by band-limited filters. Second, different people speak differently, and therefore it is harder to properly identify the speech components in the mixed signal. Third, the noise characteristics vary rapidly and unpredictably, due to the changing sources of noise as the vehicle moves. Fourth, the speech signal is not stationary, and therefore constant adaptation to its characteristics is required. Fifth, there are psycho-acoustic limits on speech quality, as will be discussed further below.
One prior art approach to speech intelligibility enhancement is filtering. As noted above, since speech and noise occupy the same bandwidth, simple band-limited filtering will not suffice. That is, the overlap of speech and noise in the same frequency band means that filtering based on frequency separation will not work. Instead, filtering may be based on the relative orthogonality between speech and noise waveforms. However, the highly non-stationary nature of speech necessitates adaptation to continuously estimate a filter to subtract the noise. The filter will also depend on the noise characteristics, which in this environment are time-varying on a slower scale than speech and depend on such factors as vehicle speed, road surface and weather.
FIG. 1 is a simplified block diagram of a conventional cabin communication system (CCS) 100 using only a microphone 102 and a loudspeaker 104. As shown in the figure, an echo canceller 106 and a conventional speech enhancement filter (SEF) 108 are connected between the microphone 102 and loudspeaker 104. A summer 110 subtracts the output of the echo canceller 106 from the input of the microphone 102, and the result is input to the SEF 108 and used as a control signal therefor. The output of the SEF 108, which is the output of the loudspeaker 26, is the input to the echo canceller 106. In the echo canceller 106, on-line identification of the transfer function of the acoustic path (including the loudspeaker 104 and the microphone 102) is performed, and the signal contribution from the acoustic path is subtracted.
In a conventional acoustic echo and noise cancellation system, the two problems of removing echos and removing noise are addressed separately and the loss in performance resulting from coupling of the adaptive SEF and the adaptive echo canceller is usually insignificant. This is because speech and noise are correlated only over a relatively short period of time. Therefore, the signal coming out of the loudspeaker can be made to be uncorrelated from the signal received directly at the microphone by adding adequate delay into the SEF. This ensures robust identification of the echo canceller and in this way the problems can be completely decoupled. The delay does not pose a problem in large enclosures, public address systems and telecommunication systems such as automobile hands-free telephones. However, it has been recognized in developing the present invention that the acoustics of relatively smaller movable cabins dictate that processing be completed in a relatively short time to prevent the perception of an echo from direct and reproduced paths. In other words, the reproduced voice output from the loudspeaker should be heard by the listener at substantially the same time as the original voice from the speaker is heard. In particular, in the cabin of a moving vehicle, the acoustic paths are such that an addition of delay beyond approximately 20 ms will sound like an echo, with one version coming from the direct path and another from the loudspeaker. This puts a limit on the total processing time, which means a limit both on the amount of delay and on the length of the signal that can be processed.
Thus, conventional adaptive filtering applied to a cabin communication system may reduce voice quality by introducing distortion or by creating artifacts such as tones or echos. If the echo cancellation process is coupled with the speech extraction filter, it becomes difficult to accurately estimate the acoustic transfer functions, and this in turn leads to poor estimates of noise spectrum and consequently poor speech intelligibility at the loudspeaker. An advantageous approach to overcoming this problem is disclosed below, as are the structure and operation of an advantageous adaptive SEF.
Several adaptive filters are known for use in the task of speech intelligibility enhancement. These filters can be broadly classified into two main categories: (1) filters based on a Wiener filtering approach and (2) filters based on the method of spectral subtraction. Two other approaches, i.e. Kalman filtering and H-infinity filtering, have also been tried, but will not be discussed further herein.
Spectral subtraction has been subjected to rigorous analysis, and it is well known, at least as it currently stands, not to be suitable for low SNR (signal-to-noise) environments because it results in “musical tone” artifacts and in unacceptable degradation in speech quality. The movable cabin in which the present invention is intended to be used is just such a low SNR environment.
Accordingly, the present invention is an improvement on Wiener filtering, which has been widely applied for speech enhancement in noisy environments. The Wiener filtering technique is statistical in nature, i.e. it constructs the optimal linear estimator (in the sense of minimizing the expected squared error) of an unknown desired stationary signal, n, from a noisy observation, y, which is also stationary. The optimal linear estimator is in the form of a convolution operator in the time domain, which is readily converted to a multiplication in the frequency domain. In the context of a noisy speech signal, the Wiener filter can be applied to estimate noise, and then the resulting estimate can be subtracted from the noisy speech to give an estimate for the speech signal.
To be concrete, let y be the noisy speech signal and let the noise be n. Then Wiener filtering requires the solution, h, to the following Wiener-Hopf equation:
                                          R            ny                    ⁡                      (            t            )                          =                              ∑                          s              =                              -                ∞                                      ∞                    ⁢                                          ⁢                                    h              ⁡                              (                s                )                                      ⁢                                          R                yy                            ⁡                              (                                  t                  -                  s                                )                                                                        (        1        )            
Here, Rny is the cross-correlation matrix of the noise-only signal with the noisy speech, Ryy is the auto-correlation matrix of the noisy speech, and h is the Wiener filter.
Although this approach is mathematically correct, it is not immediately amenable to implementation. First, since speech and noise are uncorrelated, the cross-correlation between n and y, i.e. Rny, is the same as the auto-correlation of the noise, Rnn. Second, both noise and speech are non-stationary, and therefore the infinite-length cross-correlation of the solution of Equation 1 is not useful. Obviously, infinite data is not available, and furthermore the time constraint of echo avoidance applies. Therefore, the following truncated equation is solved instead:
                                          R            nn                    ⁡                      (            t            )                          =                              ∑                          s              =                              1                -                m                                      m                    ⁢                                          ⁢                                    h              ⁡                              (                s                )                                      ⁢                                          R                yy                            ⁡                              (                                  t                  -                  s                                )                                                                        (        2        )            
Here, m is the length of the data window.
This equation can be readily solved in the frequency domain by taking Fourier Transforms, as follows:Snn(f)=H(f)syy(f)  (3)
Here, Snn and Syy are the Fourier Transforms, or equivalently the power spectral densities (PSDs), of the noise and the noisy speech signal, respectively. The auto-correlation of the noise can only be estimated, since there is no noise-only signal.
However, there are problems in this approach, which holds only in an approximate sense. First, the statistics of noise have to be continuously updated. Second, this approach fails to take into account the psycho-acoustics of the human ear, which is extremely sensitive to processing artifacts at even extremely low decibel levels. Neither does this approach take into account the anti-causal nature of speech or the relative stationarity of the noise. While several existing Wiener filtering techniques make use of ad hoc, non-linear processing of the Wiener filter coefficients in the hope of maintaining and improving speech intelligibility, these techniques do not work well and do not effectively address the practical problem of interfacing a Wiener filtering technique with the psycho-acoustics of speech.
As noted above, another aspect of the present invention is directed to the structure and operation of an advantageous adaptive acoustic echo canceller (AEC) for use with an SEF as disclosed herein. Of course, other adaptive SEFs may be used in the present invention provided they cooperate with the advantageous echo canceller in the manner disclosed below.
To realistically design a cabin communication system (CCS) that is appropriate for a relatively small, movable cabin, it has been recognized that the echo cancellation has to be adaptive because the acoustics of a cabin change due to temperature, humidity and passenger movement. It has also been recognized that noise characteristics are also time varying depending on several factors such as road and wind conditions, and therefore the SEF also has to continuously adapt to the changing conditions. A CCS couples the echo cancellation process with the SEF. The present invention is different from the prior art in in addressing the coupled on-line identification and control problem in a closed loop.
There are other aspects of the present invention that contribute to the improved functioning of the CCS. One such aspect relates to an improved AGC in accordance with the present invention controls amplification volume and related functions in the CCS, including the generation of appropriate gain control signals for overall gain and a dither gain and the prevention of amplification of undesirable transient signals.
It is well known that it is necessary for customer comfort, convenience and safety to control the volume of amplification of certain audio signals in audio communication systems such as the CCS. Such volume control should have an automatic component, although a user's manual control component is also desirable. The prior art recognizes that any microphone in a cabin will detect not only the ambient noise, but also sounds purposefully introduced into the cabin. Such sounds include, for example, sounds from the entertainment system (radio, CD player or even movie soundtracks) and passengers' speech. These sounds interfere with the microphone's receiving just a noise signal for accurate noise estimation.
Prior art AGC systems failed to deal with these additional sounds adequately. In particular, prior art AGC systems would either ignore these sounds or attempt to compensate for the sounds. In contrast, the present invention provides an advantageous way to supply a noise signal to be used by the AGC system that has had these additional noises eliminated therefrom.
A further aspect of the present invention is directed to an improved user interface installed in the cabin for improving the ease and flexibility of the CCS. In particular, while the CCS is intended to incorporate sufficient automatic control to operate satisfactorily once the initial settings are made, it is of course desirable to incorporate various manual controls to be operated by the driver and passengers to customize its operation. In this aspect of the present invention, the user interface enables customized use of the plural microphones and loudspeakers.