In recent years, increases in the operation speed of CPUs (Central Processing Unit) and storage capacity of memory have been achieved. This has made possible to realize a large-vocabulary speech recognition system capable of recognizing as many words as a few hundred thousand words by means of a statistical model using a large amount speech data or text data.
In speech recognition systems, including such a large-vocabulary speech recognition system, high speech recognition accuracy can be achieved for a speech uttered at a location close to a microphone to which the speech to be recognized is input.
However, if a speech is uttered at a distant location, the speech recognition accuracy decreases with the distance between the microphone and the location at which the speech is uttered, due to influence of noise or the like.
A first known technique to avoid the above problem is disclosed, for example, in a paper entitled “Speech Recognition in Noisy/Reverberant Environment by means of HHM Decomposition/Composition Using a Microphone Array” (Miki, Nishiura, Nakamura, and Shikano, the transaction of the institute of electronic, information and communication engineers D-II, Vol. J83-DII No. 11, pp. 2206–2214, November 2000) (hereinafter, referred to as reference 1). In this technique, a microphone array is used to improve the signal-to-noise (SN) ratio of a speech uttered at a location distant from a microphone, and speech recognition is performed on the speech with the improved signal-to-noise ration.
A second known technique is disclosed, for example, in a paper entitled “Space Diversity Robust Speech Recognition Taking Into Account Space Acoustic Characteristic” (Shimizu, Kazita, Takeda, and Itakura, the transaction of the institute of electronic, information and communication engineers D-II, Vol. J83-DII No. 11, pp. 2448–2456, November 2000) (hereinafter, referred to as reference 2). In speech recognition using this second technique, a plurality of microphones are disposed at various locations in a room, and impulse responses at locations various distances apart from the respective microphones are convoluted with speech data to be learned, and the resultant speech data is learned to produce HMMs (Hidden Markov Models) taking into account the impulse responses at respective distances. The likelihood is then calculated for each of speeches input to the respective microphones, taking into account the impulse responses at the respective distances.
However, in the first and second techniques described above, microphones have to be placed at limited locations. In some cases, the limitation on the locations of microphones makes it difficult to use those techniques.
In recent years, a toy of an autonomously-behaving robot (for example, in the form of a stuffed toy animal) has been brought into market. This robot is capable of recognize a speech uttered by a user and behaving or outputting a synthesized speech, depending on the result of speech recognition. If a speech recognition apparatus using the first technique is installed on such a robot, the limitation on the locations of a plurality of microphones forming a microphone array makes it difficult to realize a robot with a small size, and the limitation also causes a reduction in freedom in designing the robot.
On the other hand, in the case in which a speech recognition apparatus using the second technique is installed on a robot, it is required to dispose a plurality of microphone in each room in which the robot is used. This is not practical. Besides, in the second techniques, it is required to calculate the likelihood of HMM taking into account impulse responses at respective distances, for speeches input to the plurality of microphones, and thus a great amount of calculation is needed in speech recognition