1. Field of the Invention
The present invention relates to a speech recognition system and a speech recognizing method.
2. Background Art
When a robot functions while communicating with persons, for example, it has to perform speech recognition of speeches of the persons while executing motions. When the robot executes motions, so called ego noise (ego-motion noise) caused by robot motors or the like are generated. Accordingly, the robot has to perform speech recognition in the environment with ego noise being generated.
Several methods in which templates stored in advance are subtracted from spectra of obtained sounds have been proposed to reduce ego noise (S. Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-27, No. 2, 1979, and A. Ito, T. Kanayama, M. Suzuki, S. Makino, “Internal Noise Suppression for Speech Recognition by Small Robots”, Interspeech 2005, pp. 2685-2688, 2005.). These methods are single-channel based noise reduction methods. Single-channel based noise reduction methods generally degrade the intelligibility and quality of the audio signal, for example, through the distorting effects of musical noise, a phenomenon that occurs when noise estimation fails (I. Cohen, “Noise Estimation by Minima Controlled Recursive Averaging for Robust Speech Enhancement”, IEEE Signal Processing Letters, vol. 9, No. 1, 2002).
On the other hand, linear sound source separation (SSS) techniques are also very popular in the field of robot audition, where noise suppression is mostly carried out using SSS techniques with microphone arrays (K. Nakadai, H. Nakajima, Y. Hasegawa and H. Tsujino, “Sound source separation of moving speakers for robot audition”, Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3685-3688, 2009, and S. Yamamoto, J. M. Valin, K Nakadai, J. Rouat, F. Michaud, T. Ogata, and H. G. Okuno, “Enhanced Robot Speech Recognition Based on Microphone Array Source Separation and Missing Feature Theory”, IEEE/RSJ International Conference on Robotics and Automation (ICRA), 2005). However, a directional noise model such as assumed in case of interfering speakers (S. Yamamoto, K Nakadai, M. Nakano, H. Tsujino, J. M. Valin, K. Komatani, T. Ogata, and H. G. Okuno, “Real-time robot audition system that recognizes simultaneous speech in the real world”, Proc. of the IEEE/RSJ International Conference on Robots and Intelligent Systems (IROS), 2006.) or a diffuse background noise model (J. M. Valin, J. Rouat and F. Michaud, “Enhanced Robot Audition Based on Microphone Array Source Separation with Post-Filter”, Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2123-2128, 2004.) does not hold entirely for the ego-motion noise. Especially because the motors are located in the near field of the microphones, they produce sounds that have both diffuse and directional characteristics.
Thus, conventionally a speech recognition system and a speech recognizing method for high-accuracy speech recognition in the environment under ego noise have not been developed.
Accordingly, there is a need for a speech recognition system and a speech recognizing method for high-accuracy speech recognition in the environment under ego noise.