1. Field of the Invention
The present invention relates to a method and an apparatus for feature extraction from an unknown spoken word and comparison with reference words.
2. Description of the Prior Art
In a widely used method, speech recognition takes place in several successive phases. Each spoken word to be recognized is separated into elements of equal length (e.g., 20 ms), "frames", and from each element the same characteristic features are extracted by a vector normally consisting of 8 coefficients. After word-end recognition, the word is successively compared with all reference words. Each of these comparisons is performed by a method commonly referred to as Dynamic Time Warping (DTW). In this method, a "distance" between the unknown word and each reference word is calculated, and the unknown word is equated to that reference word which is the shortest distance off. For further details, see S. B. Davis and P. Mermelstein, "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences", IEEE Trans. Acoust., Speech, Signal Processing, Vol. ASSP-28, No. 4, August 1980, pp. 357-366, and H. Sakoe and S. Chiba, "Dynamic Programming Algorithm Optimization for Spoken Word Recognition", IEEE Trans. Acoust., Speech, Signal Processing, Vol. ASSP-26, February 1978, pp. 43-49. This method is very time-consuming, so that comparisons with only a few reference words are possible (small vocabulary). By the use of computers of higher capacity, the vocabulary can be enlarged, but if no additional steps are taken, the enlargement can only be proportional to the increased cost and complexity.