In current speech recognition systems, there are two commonly used techniques for acoustic modeling of words. The first technique uses word templates, and the matching process for word recognition is based on Dynamic Programming (DP) procedures. Samples for this technique are given in an article by F. Itakura, "Minimum Prediction Residual Principle Applied to Speech Recognition," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-23, 1975, pp. 67-72, and in U.S. Pat. No. 4,181,821 to F. C. Pirz and L. R. Rabiner entitled "Multiple Template Speech Recognition System."
The other technique uses phone-based Markov models which are suited for probabilistic training and decoding algorithms. A description of this technique and related procedures is given in an article by F. Jelinek, "Continuous Speech Recognition by Statistical Methods," Proceedings of the IEEE, Vol. 64, 1976, pp. 532-556.
Three aspects of these models are of particular interest:
(1) Word Specificity--word templates are better for recognition because they are constructed from an actual sample of the word. Phonetics based models are derived from man-made phonetic baseforms and represent an idealized version of the word which actually may not occur;
(2) Trainability--Markov models are superior to templates because they can be trained, e.g. by the Forward-Backward algorithm (described in the Jelinek article). Word templates use distance measures such as the Itakura distance (described in the Itakura article), spectral distance, etc., which are not trained. One exception is a method used by Bakis which allows training of word templates (R. Bakis, "Continuous Speech Recognition Via Centisecond Acoustic States," IBM Research Report RC 5971, Apr. 1976).
(3) Computational Speed--Markov models which use discrete acoustic processor output alphabets are substantially faster in computational speed than Dynamic Programming matching (as used by Itakura) or continuous parameter word templates (as used by Bakis).