1. Field of the Invention
The present invention relates to a microphone array input type speech recognition scheme in which speeches uttered by a user are inputted through microphone array and recognized.
2. Description of the Background Art
In the speech recognition, a surrounding environment under which the speech input is made can largely affect the recognition performance. In particular, background noises and reflected sounds of the user's speeches can degrade the recognition performance so that they are sources of a serious problem encountered in a use of a speech recognition system. For this reason, in general, a short range microphone designed for use near the mouth of the user such as a headset microphone or a hand microphone has been employed, but it is uncomfortable to wear the headset microphone on a head for any extended period of time, while the hand microphone can limit a freedom of the user as it occupies the user's hands, and there has been a demand for a speech input scheme that can allow more freedom to the user.
A microphone array has been studied as a potential candidate for a speech input scheme that can resolve the conventionally encountered inconvenience described above, and there are some recent reports of its application to the speech recognition system. The microphone array is a set of a plurality of microphones which are arranged at spatially different positions, where noises can be reduced by the synthetic processing of outputs of these microphones.
FIG. 1 shows a configuration of a conventional speech recognition system using a microphone array. This speech recognition system of FIG. 1 comprises a speech input unit 11 having a microphone array formed by a plurality (N sets) of microphones, a sound source direction estimation unit 12, a sound source waveform estimation unit 13, a speech detection unit 14, a speech analysis unit 15, a pattern matching unit 16, and a recognition dictionary 17.
In this configuration of FIG. 1, the speech entered at the microphone array is converted into digital signals for respective microphones by the speech input unit 11, and the speech waveforms of all channels are entered into the sound source direction estimation unit 12.
At the sound source direction estimation unit 12, a sound source position or direction is estimated from time differences among signals from different microphones, using the known delay sum array method or a method based on the cross-correlation function as disclosed in U. Bub, et al.: "Knowing Who to Listen to in Speech Recognition: Visually Guided Beamforming", ICASSP '95, pp. 848-851, 1995.
A case of estimating a direction of the sound source and a case of estimating a position of the sound source respectively correspond to a case in which the sound source is far distanced from the microphone array so that the incident sound waves can be considered as plane waves and a case in which the sound source is relatively close to the microphone array so that the sound waves can be considered as propagating in forms of spherical waves.
Next, the sound source waveform estimation unit 13 focuses the microphone array to the sound source position or direction obtained by the sound source direction estimation unit 12 by using the delay sum array method, and estimates the speech waveform of the target sound source.
Thereafter, similarly as in the usual speech recognition system, the speech analysis is carried out for the obtained speech waveform by the speech analysis unit 15, and the pattern matching using the recognition dictionary 17 is carried out for the obtained analysis parameter, so as to obtain the recognition result. For a method of pattern matching, there are several known methods including the HMM (Hidden Markov Model), the multiple similarity method, and the DP matching, as detailed in Rabiner et al.: "Fundamentals of Speech Recognition", Prentice Hall, for example.
Now, in the speech recognition system, it is custom to input the speech waveform. For this reason, even in the conventional speech recognition system using the microphone array as described above, the sound source position (or the sound source direction) and the speech waveform are obtained by processing the microphone array outputs according to the delay sum array method, due to a need to estimate the speech waveform by a small amount of calculations. The delay sum array method is often utilized because the speech waveform can be obtained by a relatively small amount of calculations, but the delay sum array method is also associated with a problem that the separation power is lowered when a plurality of sound sources are located close to each other.
On the other hand, as a method for estimating the sound source position (or direction), there is a parametric method based on a model as disclosed in S. V. Pillai: "Array Signal Processing", Springer-Verlag, New York, 1989, for example, which is presumably capable of estimating the sound source position at higher precision than the delay sum array method, and which is also capable of obtaining the power spectrum necessary for the speech recognition from the sound source position estimation processing at the same time.
FIG. 2 shows a processing configuration for this conventionally proposed parametric method. In the configuration of FIG. 2, signals from a plurality of microphones are entered at a speech input unit 21, and the frequency analysis based on the FFT (Fast Fourier Transform) is carried out at a frequency analysis unit 22. Then, the sound source position estimation processing is carried out for each frequency component at a power estimation unit 23, and the final sound source position estimation result is obtained by synthesizing the estimation results for all the frequencies at a sound source direction judgement unit 24.
Here, the sound source position estimation processing is a processing for estimating a power at each direction or position while minutely changing a direction or position over a range in which the sound source can possibly be located, so that a very large amount of calculations are required. In particular, in a case of assuming the propagation of sound waves in forms of spherical waves, it is going to estimate a position of the sound source rather than an arriving direction of the sound waves, so that two- or three-dimensional scanning is necessary and consequently an enormous amount of calculations are required.
Moreover, in the conventionally proposed parametric method described above, it is necessary to carry out this scanning processing for each frequency component obtained by the fast Fourier transform of the speech, so that it is difficult to reduce a required amount of calculations.