When voice recognition is performed in an embedded system, there arises a problem of how to perform voice recognition based on limited resources (CPU, memory and others). Voice recognition is generally performed in a sequential flow where an input signal from a voice input element is A/D converted, the data thus obtained is stored in a buffer, data appropriately delivered from the buffer is processed by a recognition engine, and the recognition result is output. In a process by the recognition engine, first, a voice detection process is performed, then a heavy-load recognition process (voice recognition process) is performed only for a segment which is determined to include a voice in the input signal.
When the resource is insufficient to handle the load of the recognition process, it cannot process voice data delivered from the buffer and processing delay for the delivered data occurs. As a result, time required for recognition is markedly prolonged. Furthermore, following a delay in the recognition, there also arises a problem that the buffer overflows due to a delay in delivering data from the data buffer to the engine. When there is a heavy load process other than a voice recognition process operating simultaneously with the voice recognition process, there also arises a problem that these processes delay the voice recognition process.
When the load of the voice recognition causes a processing delay, the degree of its influence varies according to frequency of voice detection. If the frequency of voice detection is small (low), the processing delay can be recovered during the off period of the recognition process; however, if the frequency is large (high), the processing delay accumulates.
A general voice recognition engine detects a beginning edge of an utterance based on whether a feature value such as power of or S/N ratio of the input voice exceeds a threshold or not and starts voice detection. Then, it detects a trailing edge at the time when the above described feature value is less than a threshold value for a given period of time. These methods of voice detection have a feature that a voice is scarcely detected when used in a quiet (low noise) environment. However, there occurs a problem that frequency of voice detection becomes high in an environment where surrounding noise or operational sound of a system itself is intermittently introduced into an input signal.
A method of adjusting a threshold for voice detection according to a noise level, as a means for accurately detecting an utterance segment, is described in Patent Document 1. A method of extracting and detecting a segment that appears to be an utterance segment, by matching against the standard model of vowel when a voice is detected, is described in Patent Document 2.