Filtering noise from the signal in order to obtain a better quality of a vocal signal before its interpretation is known. The FIG. 1 shows the general principle of the command signal processing by filtering the noise before presenting the vocal signal to the automatic speech recognition system. The vocal signal s(n) is disturbed by a noise signal d(n) and the resulting signal is y(n). This signal y(n) enters in a pre-processing unit 2 in order to improve the signal quality by filtering the noise. The filtered signal s(n) is provided as output and is presented to an automatic speech recognition module 63. However, in most situations, because the noise consists in multiple heterogeneous sources which are difficult to model, it is often very difficult, and even impossible, to define an efficient filter which can effectively reduce the noise components. Furthermore, an inappropriate determination of the filter, based on wrong noise models or an inaccurate estimation, can even lead to a partial destruction of the vocal signal making the pre-processing sometimes worse than if nothing had been performed.
Several solutions had been proposed for improving the vocal signal quality. For example, it is known that the usage of a microphone array combined with a beam forming control increases the gain of the received signal in particular directions and makes a system less sensitive to directional noise and interference. However, those systems, to be efficient, can become costly because of the usage of the microphone array, and are not easy to integrate considering the constraints concerning the interior esthetic of vehicles. Furthermore, such systems remain very limited for performances because directional interferences inside of vehicles are not the major disturbances, so that those systems can only partially solve the problem or can only solve the problem in a very limited number of configurations.
Among the other proposed solutions, noise or interference reduction is based on the addition of a noise reference sensor to obtain a reference signal of the noise. For example, it is possible to place a first microphone close to the driver, and a second microphone far from him. The first microphone gets the signal of interest, meaning the vocal command, while the second microphone only senses, in principle, the noise signal. However, in practice, this solution is not satisfactory because it is very difficult to simultaneously obtain a representative signal of the local noise around the speaker at a microphone which is far from the speaker/driver. If the microphone is far from the speaker, an approximate reference of the noise is generated and this approximate noise reference is unusable and can be even inappropriate for the system as explained above. If, on the other hand, the second microphone is put too close to the speaker, the noise component in the received signal can be more representative of the local noise around the speaker but it would be impossible to avoid a contribution and a mixing (or leakage) of the signal of interest in the signal of the second microphone. This could lead in a partial and even total destruction of the signal of interest because, in this case, the signal of interest will itself be considered as a noise component and will be suppressed by the noise subtraction process.
In other proposed solutions for solving this problem, architectures exist which integrate non acoustic sensors which can be considered as a means to define the noise reference. For example, in Japanese patent JP2244099 assigned to AISIN SEIKI Company, illustrates talk with the usage of the electric signal delivered to the loudspeaker of the audio system as a source of noise reference. The advantage of such sensors is the avoidance of the leakage of the signal of interest in the noise reference, because, in this case, the reference signal is no longer an acoustic signal containing a contribution of the acoustic signal of interest. For example, a vibration phenomenon can be detected. In a general manner, two types of sensors can be distinguished: the sensors in contact with the speaker body and those without contact with the speaker body. The first type of sensors is, obviously, very constraining for the application to a vehicle driver and is not interesting in our case. The second seems more appropriate for the type of envisaged applications and will be considered in the description of the invention.
Another possibility to filter the noise signal consists of estimating the noise component before the beginning of the reception of the speech signal and subtracting it from the received signal during the entire period of reception of the mixed signal composed of the signal of interest and the noise. Under these conditions, in order to perform this operation with reliability, it is necessary to use a voice activity detector in order to know the speech period and subtract the estimated noise signal from the received signal. The estimation of the noise is obtained just before the begin of the speech signal. To do so, the speech signal is considered to be greatly superior in energy compared to the surrounding noise signal. Hence, by using a threshold on the received signal energy, the speech signal reception period can be detected and the previously estimated noise can be suppressed according to the principle previously described. However, this detection principle based on energy threshold is not robust, for example, in the case of sounds with fricative consonance. Furthermore, the principal and implicit assumption of such process is that the noise does not evolve during the reception of the speech signal. However, for the type of concerned applications, the environment of the vehicle imposes other constraints which lead in general to an environment where the noise and interferences are not constant, and can vary with the vehicle speed (acceleration or deceleration), the output of the audio system, the activation of the wipers, the blinkers, etc. One can easily understand that the implicit and restrictive assumptions made are not applicable for the considered cases. Therefore it is necessary to take into account this noise variation during the reception of the speech signal and to realize a continuous noise reduction is operational even during the speech signal reception without any stationary assumptions concerning the noise component.