The present invention relates to an apparatus and a method for recognizing speech, and more particularly to a speech recognition apparatus and method for starting, on a real-time basis, recognizing input speed immediately after the point in time at which the input speed is first detected, but before the point in time at which the end of the input speed is confirmed.
In general, a speech sound has different lengths each time it is uttered. The speech length is not linearly varied as a whole since the length of vowels is especially variable. FIG. 1 of the accompanying drawings shows the different sound lengths of various words. The pronounced words in FIG. 1 include the English words "ON", "OFF", "START", and "STOP", and Japanese words "HAI", "UE", and "SHITA".
As is apparent from FIG. 1, the pronounced words have largely different lengths which vary from individual to individual or dependent on the psychological condition of a speaker. Even when the speaker feels that he is pronouncing words in a standard manner, the pronounced word length varies in the range of from 20% to 40%. Therefore, some measure should be taken to achieve good speech recognition of words having such different pronounced word lengths.
To cope with the above varying pronounced word lengths, there is one known speech recognition method in which reference templates are stored as time series for a frequency component with respect to each of respective reference speech sounds, an input pattern is extracted from input speech as a time series for the same frequency component, the accumulated difference (hereinafter referred to as "dissimilarity") between the input pattern and each of the reference template is calculated, and the input speech is recognized based on the calculated dissimilarity. In the above method, each of the reference templates or the input speech pattern is normally produced by effecting a frequency analysis in regularly established frames, normalizing the length of a vocal tract using logarithmic conversion and a least square fit approximation line, and expressing the template or pattern as a time series for a frequency component.
Methods of establishing matching paths for calculating the dissimilarity between an input speech pattern and each reference template include a DP matching method using a dynamic programming method and a linear matching method. The DP matching method has an increased matching accuracy, but requires many calculations which, if carried out by a hardware design, results in the use of many gates. The linear matching method is relatively effective for recognizing words having short syllables. Although the linear matching method requires less calculations than the DP matching method, it requires at least a memory for storing information on a speech sound from its start point to the end point. It has been difficult to implement the linear matching method with an apparatus having a limited circuit arrangement.