1. Field of the Invention
The present invention relates to a method for realizing high-accuracy speech recognition in which speech recognition including an input of a command to start speech, such as a button depression, is performed, and a speech can be made before depressing the button.
2. Description of the Related Art
When speech recognition is performed, it is necessary to set a distance between the user's mouth and microphone, and an input level appropriately, as well as properly inputting a command to start speech (usually by depressing a button), in order to prevent errors due to ambient noise. If these are not done appropriately, there will be a substantial degradation in the recognition performance. However, users do not always make such settings or input properly, and it becomes necessary to take measures to prevent performance degradation in these cases. In particular, sometimes the command to start speech is not inputted correctly, for example, the speech is made before the button is depressed. In such a case, the beginning of the speech will be omitted since the speech is imported through the microphone after the command to start speech is inputted. When conventional speech recognition is performed based on the omitted speech, the recognition rate will drop greatly in comparison to the case where the command to start speech is inputted correctly.
In consideration of such a problem, Japanese patent No. 2829014 discusses a method which provides a ring buffer that at all times imports speech of a constant length, besides a data buffer for storing speech data imported after the command to start the recognition process is inputted. After the command is inputted, a head of the speech is detected using the speech imported by the data buffer. In the case where the head of the speech is not detected, the detection of the speech head is conducted by using in addition the speech before the command was inputted, which is stored in the ring buffer. In this method, since the ring buffer has to constantly perform a speech importing process, an additional CPU load is required as compared to the case where only the data buffer is employed. That is, it is not necessarily a suitable method for use in battery-operated devices such as mobile devices.
Furthermore, Japanese patent No. 3588929 discusses a method in which a word with a semi-syllable or a mono-syllable omitted at the beginning of the word is also a target to be recognized. In this manner, degradation of the speech recognition rate is prevented in a noisy environment. Moreover, Japanese patent No. 3588929 discusses a method for performing control to determine whether a word with an omitted head portion should be the target word to be recognized depending on the noise level. In this method, determination as to whether to omit a semi-syllable or a mono-syllable at the beginning of the word is made based on the type of the semi-syllable or the mono-syllable at the beginning of the word or the noise level. If it is determined to make an omission, the word without an omission is not appointed as the target word to be recognized. Additionally, when it is determined whether to omit the beginning of the word, it is not considered whether the command to start speech inputted by the user's operation or movement is performing correctly. Therefore, in Japanese patent No. 3588929, the omission of the beginning of the word is up to one syllable, and in a quiet environment, the beginning of the word is not omitted. As a result, in the case where a speech is made before the button is depressed, and, for example, two syllables in the speech are omitted in a quiet atmosphere, the degradation of recognition performance cannot be avoided.
In view of the above problem, the object of the present invention is directed to a method to prevent degradation of the recognition performance by a simple and easy process in the case where the beginning of a speech is missing or omitted. Such omission occurs when the command to start speech is improperly input by a user.