As controllability of peripheral devices by a computer device has been improved, systems for automatically recognizing a speech inputted as a speech input from a microphone and the like are desirable. The above-mentioned speech recognition device for recognizing speech as input can be assumed to be utilized for various applications such as dictation of a document, transcription of minutes of a meeting, interaction with a robot, and control of an external machine. The above-mentioned speech recognition device essentially analyzes inputted speech to acquire a feature quantity, selects a word corresponding to the speech based on the acquired feature quantity, and thereby causes a computer device to recognize the speech. Various methods have been proposed to exclude influence from the environment, such as background noises, in performing speech recognition. A typical example is a method in which a user is required to use a hand microphone or a head-set type microphone in order to exclude echoes or noises which may be superimposed with the speech to be recorded and to acquire only the inputted speech. In such a method, a user is required to use such extra hardware as are not usually used.
One reason that a user is required to use the above-mentioned hand microphone or a head-set type microphone is that, if the speaker speaks away from a microphone, an echo may be generated depending on the environment, in addition to the influence of environmental noises. If an echo is superimposed onto an speech signal, in addition to noises, speech recognition mismatch is caused in a statistical model for each speech used in speech recognition (e.g., the hidden Markov model) which results in degradation of recognition efficiency.
FIG. 9 shows a typical method in which noises are taken into consideration when performing speech recognition. As shown in FIG. 9, if there is a noise, an inputted signal has a speech signal and output probability distribution in which the speech signal is superimposed with a noise signal. Since, in many cases, a noise occurs suddenly a method is employed in which a microphone for acquiring an input signal and a microphone for acquiring a noise are used and, with the use of a so-called two-channel signal, a speech signal and a noise signal are separately acquired from the input signal. A traditional speech signal shown in FIG. 9 is acquired from a first channel, and a noise signal is acquired from a second channel, so that, with a use of a two-channel signal, an original speech signal can be recognized from an inputted speech signal even under a noisy environment.
However, hardware resources of a speech recognition device are consumed by use of data for two channels, and in addition, a two-channel input may not be available in some cases. Therefore, the above method does not always enable efficient recognition. Furthermore, it may inconveniently restrict realistic speech recognition that information of the two channels is always required simultaneously.
Conventionally, as a method for coping with influence from a speech transfer route, the cepstrum mean subtraction (CMS) method has been employed. A disadvantage has been known that the CMS method is effective when the impulse response of a transfer characteristic is relatively short (several milliseconds to several dozen milliseconds), such as the case of influence of a telephone line, but is not sufficiently effective in performance when the impulse response of a transfer characteristic is longer (several hundred milliseconds), such as the case of an echo in a room. The reason for the disadvantage is that the length of the transfer characteristic of an echo in a room is generally longer than the window width (10 msec-40 msec) for a short-distance analysis used for speech recognition, and therefore the impulse response is not stable in the analysis interval.
As an echo suppression method in which short-interval analysis is not employed, there has been proposed a method in which multiple microphones are used and an inverse filter is designed to exclude echo components from a speech signal (M. Miyoshi and Y. Kaneda, “Inverse Filtering of room acoustics,” IEEE Trans. on ASSP, Vol. 36, pp. 145-152, No. 2, 1988). This method has a disadvantage that the impulse response of an acoustic transfer characteristic may not be in the minimum phase; and, therefore it is difficult to design a realistic inverse filter. Furthermore, multiple microphones often may not be installed because of the cost and physical arrangement condition, depending on the intended use environment.
As a method for coping with an echo, various methods have been proposed such as an echo canceller disclosed in Published Unexamined Patent Application No. 2002-152093, for example. However, these methods require speech to be inputted with two channels and are not capable of coping with an echo encountered with one-channel speech input. As an echo canceller technique, the method and the device described in Published Unexamined Patent Application No. 9-261133 are known. However, the echo processing method disclosed in the Published Unexamined Patent Application No. 9-261133 is not a generalized method because it requires speech measurement at multiple places under the same echo environment.
As for speech recognition in which environmental noises are taken into consideration, it is possible to cope with noises using a method, such as a method of recognizing a speech under sudden noises by selecting an acoustic model for each frame, which is disclosed in Patent Application Specification No. 2002-72456 attributed to the common applicant, for example. However, an effective method related to speech recognition, which effectively utilizes the characteristic not of a suddenly generated noise but of an echo generated depending on an environment, has not been known.
A method of predicting an intra-frame transfer characteristic H to feed it back for speech recognition has been reported by T. Takiguchi, et al. (“HMM-Separation-Based Speech Recognition for a Distant Moving Speaker”, IEEE Trans. on SAP, Vol. 9, pp. 127-140, No. 2, 2001), for example. In this method, a transfer characteristic H in a frame is used to reflect the influence of an echo; a speech input is inputted via a head-set type microphone as a reference signal; an echo signal is separately measured; and then, based on the result of the two-channel measurement, an echo prediction coefficient α for predicting an echo is acquired. Though a case is shown where echo influence is not taken into consideration at all, even when using the above method by Takiguchi et al., it is also shown that speech recognition with a sufficiently high accuracy can be performed in comparison with processing by a CMS method; however, this method does not enable speech recognition only from a speech signal measured in a hand-free environment.
If a user who does not use his hands or a user in an environment where a head-set type microphone can not be carried or worn is able to perform speech recognition, availability of speech recognition can be considerably extended. Furthermore, though the existing techniques described above are known, availability of speech recognition can be further extended if the speech recognition accuracy can be further improved in comparison with the existing techniques. For example, the above-mentioned environments include a case where processing is performed based on speech recognition when driving a vehicle or piloting a plane, or during movement within a large space, and a case where speech is inputted into a notebook-type personal computer or a microphone located at a distance for a kiosk device.
As described above, at least use of a head-set type microphone or a hand microphone is assumed in traditional speech recognition methods. However, with miniaturization of computer devices and expansion of applications, there is an increasing demand for a speech recognition method to be used in an environment where echoes must be taken into consideration and an increasing demand for enabling a hands-free speech recognition function even in an environment where echoes may be generated. In the present invention, the term “hands-free” is used to mean a condition in which a speaker can speak at any position without restriction by the position of a microphone.