The accuracy of speaker independent speech recognition is inadequate with current algorithms, especially when the recognition is done through dialed-up telephone lines. The accuracy of a speech recognizer means the ability to recognize an utterance by comparing it to the system's precomputed word templates.
Traditionally, Hidden Markov Models (HMM) that are based on probability theory are used in speech recognizers. During the recognition phase a probability that a certain model can produce the utterance is computed. The model that has the highest probability is selected as the recognized word.
A speech recognition method that uses vector quantization (VQ) with HMMs instead of statistical pattern matching is known as described in S. Nakagawa and H. Suzuki, "A New Speech Recognition Method Based on VQ-Distortion Measure and HMM", Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. II-676 to II-679, Minneapolis, Minn., U.S.A., Apr. 27-30, 1993, incorporated herein by reference. During the recognition phase the squared error is computed between a word template and a given utterance. Word templates are HMMs where each state has its own VQ-codebook. Every VQ-codebook is computed from training data with the LBG-vector quantization algorithm as described in Y. Linde, A. Buzo, R. M. Gray, "An Algorithm for Vector Quantizer Design", IEEE Transactions on Communications, Vol. COM-28, No. 1, January 1980, incorporated herein by reference, and it contains the typical speech parameters that occur in that state. A template that gives the smallest square error is chosen as the recognized word. The modified Viterbi-algorithm that is used in computing the distance is also presented in Nakagawa et al. supra. A speech recognizer that uses HMMs with continuous mixture densities is presented in L. R. Rabiner, J. G. Wilpon and F. K. Soong, "Higher Performance Connected Digit Recognition Using Hidden Markov Models", IEEE Transactions on Acoustics Speech and Signal Processing, Vol. 37, pp. 1214-1225, August 1989, incorporated herein by reference. It uses the cepstrum derived from LPC-analysis and its derivative as the speech parameters (spectral derivative). The vector that is computed from speech contains short-term information about spectral changes in the signal (via the cepstrum) and the short-time spectral derivative contains information from longer time span (via the delta cepstrum). By adding the spectral derivative to the speech parameters a more accurate, 2-dimensional presentation of the time-varying speech signal is obtained (frequency and time). According to Rabiner et al. supra, this enhances the recognition accuracy of HMM-model that uses continuous mixture densities. However, the recognition accuracy with both of these methods is inadequate.
One known algorithm that is used for speaker verification gives a 1% false recognition and false rejection rate when using numbers from zero to nine to perform verification, High Accuracy Speaker Verification System Specification Sheet, April 1992, Ensigma Ltd., Turing House, Station Road, Chepstow, Gwent, NP6 5PB, United Kingdom, incorporated herein by reference. (The reference does not mention how many numbers the user has to speak during the verification process.)