In the art of automatic speech recognition, two approaches are commonly used for recognizing isolated words. Dynamic time warping. (DTW) matches an unknown input utterance with a library of stored spectral patterns or templates using a procedure that dynamically alters the time dimension to minimize the accumulated distance score for each template. As a result, variation in taking rate is desensitized. See, F. Itakura, "Minimum Prediction Residual Principle Applied to Speech Recognition," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-23, No. 1, February 1975, pp. 66-72. On the other hand, the hidden Markov model (HMM) approach characterizes speech as a plurality of statistical chains. HMM creates a statistical, finite-state Markov chain for each vocabulary word while it trains the data. It then computes the probability of generating the state sequence for each vocabulary word. The word with the highest accumulated probability is selected as the correct identification. Under HMM time alignment is obtained indirectly through the sequence of states. See, S. E. Levinson, L. R. Rabiner, and M. M. Sondhi, "An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition," The Bell System Technical Journal, Vol. 62, No. 4, April 1983, pp. 1035-1074.
The front ends of the DTW and HMM approaches are similar: an unknown spoken utterance is converted into digital representation via analogue-to-digital converter, and the result is analyzed using either linear predictive coding (LPC) or filter banks to extract its spectral features. See, J. D. Markel and A. H. Gray Jr., Linear Prediction of Speech, (Springer-Verlag: New York, 1976). Also see, "Speech Processing", AT & T Technical Journal, Vol. 65, No. 5, Sep./Oct. 1986. The features can be classified into a finite set of templates, using vector quantization. The templates are then compared to a library or stored set of vocabulary templates to determine the closest match. This set of stored vocabulary templates are predetermined from measurements on speech data. The unknown input is then identified as the closest matching vocabulary entry. If the computer or machine does not find a close enough match, it can announce this result by either sounding a alarm or its synthetic voice.
The performance of speech recognizers depends on the design parameters selected, vocabulary nature and size, and acoustic environment. In general, a conventional DTW does slightly better than one of a HMM design. However, speech recognizers of DTW design is computationally intensive. Although a technique called pruning is used to reduce the computational requirement of DTW speech recognizers, it computational requirements are far too high to be implemented in a personal computer based (PC) system.