First of all, descriptions will be provided for the current status of an in-vehicle speech recognition system which constitutes the background of the present invention. The in-vehicle speech recognition system has reached a level of practical use where the in-vehicle speech recognition system is applied mainly to the inputting of commands, addresses and the like in a car navigation system. In reality, however, CD music needs to be stopped from being played, or passengers need to refrain from talking, while speech recognition is being performed. In addition, speech recognition can not be performed in a case where a crossing bell is being sounding in a nearby railroad crossing. Consequently, reviewing the present level of development of the in-vehicle speech recognition, one may think that many restraints have still been imposed on use of the in-vehicle speech recognition system, and that the in-vehicle speech recognition system is still technically in a transition period.
One may think that noise robustness in the in-vehicle speech recognition system will be achieved step by step through its technological development ladder 1 to 5 as shown in FIG. 11. In other words, in its development ladder 1, what the in-vehicle speech recognition system is robust against is only stationary driving noise. In its development ladder 2, what the in-vehicle speech recognition system is robust against will be noise in which the stationary driving noise as well as speeches and sounds coming from a CD player or a radio (hereinafter referred to as a “CD/radio”) are mixed with each other. In its development ladder 3, what the in-vehicle speech recognition system is robust against will be noise in which the stationary driving noise and non-stationary environment noise are mixed each other. The non-stationary environment noise includes noise which is made while the car runs on a bumpy road, noise which is made by other cars passing by the car, noise which is made by the windshield wipers in operation, and the like. In its development ladder 4, what the in-vehicle speech recognition system is robust against will be noise in which the stationary driving noise, the non-stationary environment noise and the sounds coming from the CD/radio are mixed with one another. In its development ladder 5, the stationary driving noise, the non-stationary environment noise, the sounds coming from the CD/radio, and speeches uttered by passengers are mixed with one another. The current technological level is at its development ladder 1. Intensive studies are being carried out in order to make the technological level reach its development ladders 2 and 3.
In the case of its development ladder 1, a multi-style training technique and a spectral subtraction technique have made great contributions to enhancing the noise robustness. The multi-style training technique is a technique for using sound, in which various noises are superimposed on speeches uttered by humans, for the adaptive learning of an acoustic model. In addition, stationary noise components are subtracted from an observed signal by use of the spectral subtraction technique, both when speech recognition is performed and when an acoustic model is adaptively trained. These techniques have remarkably enhanced noise robustness. As a consequence, the speech recognition system has reached the level of practical use as far as the stationary cruising noise is concerned.
The sounds coming from the CD/radio to be treated in its development ladder 2 are non-stationary noise as in the case of the non-stationary environment noise to be treated in its development ladder 3. However, the sounds coming from the CD/radio is different from the non-stationary environment noise in that the sounds coming from the CD/radio are sounds coming from specific in-vehicle appliances. For this reason, electric signals which have not yet been converted to the sounds can be used, as reference signals, in order to suppress noise. A system for suppressing noise by use of electric signals is termed as an echo canceller. It is known that the echo canceller exhibits high performance in a silent environment where no noise exists except for sounds from the CD/radio. For this reason, it is expected that both the echo canceller and the spectral subtraction technique are used in the development ladder 2 of the in-vehicle speech recognition system. It is known, however, that performance of a conventional echo canceller is degraded in a vehicle compartment of a car which is moving. This is because noise, including driving noise irrelevant to reference signals, is observed at the same time as the reference signals are observed.
FIG. 12 is a block diagram showing a configuration of a conventional noise reduction device using only a conventional echo canceller. In general, what is termed as an echo canceller means an echo canceller 40 implemented in the time domain. At this point, suppose that neither speech s uttered by a speaker nor background noise n exists for convenience of explanation. Let r and x respectively denote a sound signal of the CD/radio 2 to be inputted to a loudspeaker 3 and an echo signal to be received by a microphone 1. By use of an impulse response g in the vehicle compartment, the sound signal and the echo signal are related to each other as followsx=r*g where * denotes a convolution calculation.
In this respect, the echo canceller 40 can cancel the echo signal x through the following process. An estimated value h of the impulse response g is figured out in an adaptive filter 42. Thus, an estimated echo signal r*h is generated. In a subtraction unit 43, the estimated echo signal r*h is subtracted from a signal In of sound received by the microphone 1. Thereby, the echo signal x can be cancelled. In general, a filter coefficient h is learned in a non-speech segment by use of a least-mean-square (LMS) algorithm or a normalized least-mean-square (N-LMS) algorithm. The echo canceller takes both a phase and an amplitude into consideration. For this reason, it can be expected that the echo canceller brings about a higher performance as far as a silent environment is concerned. It is known, however, that the performance decreases when environment noise around the echo canceller is high.
FIG. 13 is a block diagram showing a configuration of another conventional noise reduction device, which includes an echo canceller 40 in its front stage and a noise reduction unit 50 in its rear stage. The noise reduction unit 50 reduces stationary noise. Here is used the noise reduction unit using a spectral subtraction technique. This device exhibits a higher performance than the device using only the echo canceller and the device using only the spectral subtraction technique. However, an input In into the echo canceller 40 in the front stage includes stationary noise to be reduced in the rear stage. This brings about a problem which decreases performance of the echo cancellation (for example, see Basbug, F., Swaminathan, K., and Nandkumar, S. [2000]. “Integrated Noise Reduction and Echo Cancellation For IS-136 Systems,” Proceedings of ICASSP, vol. 3, pp. 1863-1866, which will be hereinafter referred to “Non-patent Literature 1).
As measures to increase performance of the echo canceller in a noisy environment, one may conceive that noise reduction is performed before noise cancellation is performed. In theory, however, the noise reduction using the spectral subtraction technique can not be performed before the echo canceller is implemented in the time domain. In addition, if noise reduction is designed to be performed by use of a filter, the echo canceller can not follow change in the filter. Furthermore, if the noise reduction is performed before the noise cancellation is performed, this brings about a problem that echo components obstructs the estimating of stationary noise components for the purpose of the noise reduction. For this reason, there have been a small number of cases where the noise reduction is performed before the echo cancellation is performed.
FIG. 14 is a block diagram showing one of such cases. A noise reduction device of this type includes: a noise reduction unit 60 for performing noise reduction by means of performing spectral subtraction in its front stage; and an echo canceller 70 in its rear stage. Noise reduction is attempted both in the stage prior to, and in the stage posterior to, the echo canceller, in the case of the noise reduction device including this configuration disclosed in Ayad, B., Faucon, G., and B-Jeannes, R. L. [1996]. “Optimization of a Noise Reduction Preprocessing in an Acoustic Echo and Noise Controller,” Proceedings of ICASSP, vol. 2. However, the noise reduction to be performed in the stage prior to the echo canceller holds a mere pre-processing function.
If an echo canceller using the spectral subtraction technique or a Wiener filter in the frequency domain is adopted as the echo canceller 70 in the rear stage, the noise reduction can be performed before the echo cancellation is performed, or at the same time as the echo cancellation is performed. In this case, however, echo components are included in noise components to be reduced, in the noise reduction unit 60. This makes it difficult to estimate stationary noise components exactly. With this difficulty into consideration, an application of the noise reduction device disclosed in Non-patent Literature 1 is limited to talks on the phone. The noise reduction device disclosed in Non-patent Literature 1 is designed to measure stationary noise components during a time when the two calling parties utter no speech, or during a time when only background noise exists.
FIG. 15 shows an example of yet another conventional noise reduction device. This example is a noise reduction device which is realized by further providing the noise reduction device of FIG. 14 with the echo canceller 40 in the time domain in the stage prior to the noise reduction unit 60 for the purpose of estimating the stationary noise components more exactly. Accordingly, this noise reduction device is designed to reduce echo components beforehand (for example, see Dreiseitel, P., and Puder, H. [1997]. “A Combination of Noise Reduction and Improved Echo Cancellation,” Conference Proceedings of IWAENC, London, 1997, pp. 180-183 (which will be hereinafter referred to as “Non-patent Literature 3), and Sakauchi, S., Nakagawa, A., Haneda, Y., and Kataoka, A. [2003]. “Implementing and Evaluating an Audio Teleconferencing Terminal with Noise and Echo Reduction,” Conference Proceedings of IWAENC, Japan, 2003, pp. 191-194 (which will be hereinafter referred to as “Non-patent Literature 4)). In this case, even if the pre-processing is performed by use of the echo canceller 40, some echo components still remain. However, what the noise reduction device is applied to is hands-free talks. This makes it possible to expect that a time occurs during which the two calling parties utter no speech, or during which only background noise exists. For this reason, stationary noise components may be measured more exactly during the time when the two calling parties utter no speech, or during the time when only background noise exists.
In the case of these conventional noise reduction devices, the respective echo cancellers are constituted in a two-stage manner. These constitutions make it possible to reduce echo more securely. In the case of each of the noise reduction devices disclosed in Non-patent Literatures 3 and 4, echo components which are as large as designated by an estimate value of the echo are reduced as they are. For this reason, the echo components can not be eliminated completely. In addition, in the case of the noise reduction device disclosed in Non-patent Literature 3, flooring is performed on the basis of a value of output from the preprocessing. In the case of the noise reduction device disclosed in Non-patent Literature 4, an original sound adding method for improving audibility is adopted. In each of the two cases, echo elements can not be reduced to zero. On the other hand, in a case where residual noise is in the form of music or spoken news, no matter how much the power of the residual noise may be weakened, it is likely that the noise is treated as human speeches, and that this treatment leads to a false recognition, when speech recognition is intended to be performed.
Non-patent Literature 4 also refers to a scheme for dealing with reverberation of echo. According to this scheme, while an echo cancellation process is being performed, an estimated value of echo, which has been found in a previous frame, is multiplied by a coefficient, and a value thus obtained is added to an estimated value of echo in the current frame. Thereby, the echo cancellation process is performed on both echo components and reverberation components. However, this brings about a problem that the coefficient needs to be given corresponding to an environment in a room in advance, and that the coefficient is not determined automatically.
An echo canceller using a power spectrum in the frequency domain can deal with not only a case where echo and reference signals to be referred to in order to reduce the echo are in the form of monophonic signals, but also a case where they are in the form of stereo signals. Specifically, a power spectrum of a reference signal may be defined as a weighted average of the right and left reference signals, and the weight may be determined in accordance with a degree of a correlation among the observed signal as well as its right and left reference signals, as described in Deligne, S., and Gopinath, R. [2001]. “Robust Speech Recognition with Multi-channel Codebook Dependent Cepstral Normalization (MCDCN),” Conference Proceedings of ASRU, 2001, pp. 151-154. In a case where a pre-process is intended to be performed for an echo canceller in the time domain, a stereo echo canceller technique, on which many research results have been disclosed, may be applied to the pre-process.