This invention relates to speech recognition and more particularly to the types of such systems based on a hybrid combining hidden Markov models and multilayer perceptrons (HMMs and MLPs).
By way of background, an instructive tutorial on hidden Markov modeling processes is found in a 1986 paper by Rabiner et al., "An Introduction to Hidden Markov Models," IEEE ASSP Magazine, January, 1986, pp. 4-16. A tutorial treatment on neural networks, including multilayer perceptron-type neural networks, is found in a paper by Richard P. Lippmann, entitled "An Introduction to Computing with Neural Nets," IEEE ASSP Magazine, April, 1987, pp. 4-23.
Various hidden-Markov-model-based speech recognition systems are known and need not be detailed herein. Such systems typically use realizations of phonemes which are statistical models of phonetic segments (including allophones or, more generically, phones) having parameters that are estimated from a set of training examples.
Models of words are made by concatenating appropriate phone models, a phone being an acoustic realization of a phoneme, a phoneme being the minimum unit of speech capable of use in distinguishing words. Recognition consists of finding the most-likely path through the set of word models for the input speech signal.
Known hidden Markov model speech recognition systems are based on a model of speech production as a Markov source. The speech units being modeled are represented by finite state machines. Probability distributions are associated with the transitions leaving each node, specifying the probability of taking each transition when visiting the node. A probability distribution over output symbols is associated with each node. The transition probability distributions implicitly model duration. The output symbol distributions are typically used to model speech characteristics such as spectra.
The probability distributions for transitions and output symbols are estimated using labeled examples of speech. Recognition consists of determining the path through the Markov chain that has the highest probability of generating the observed sequence. For continuous speech, this path will correspond to a sequence of word models.
The specific hidden Markov model recognition system employed in conjunction with the present invention is the Decipher speech recognizer, which is available from SRI International of Menlo Park, Calif. The Decipher system incorporates probabilistic phonological information, a trainer capable of training phonetic models with different levels of context dependence, multiple pronunciations for words, and a recognizer. The recognizer can process speaker-independent continuous speech input by dealing with specific models developed through training.
Speech recognition using multilayer perceptrons as a source of the state-dependent observation likelihoods in a hidden Markov model has recently been proposed. A multilayer perceptron is a specific type of a neural network composed of simple processing elements which in this context is limited to those which compute a dot product of incoming activations and weights, plus a bias term, the result of which is mapped through a differentiable sigmoid nonlinearity. The hybridization of HMMs with MLPs was first suggested by groups led by Herve' Bourlard in papers entitled "Links between Markov Models and Multi-layer Perceptrons," Advances in Neural Information Processing Systems, Vol. 2, pp. 502-510, 1989 (Morgan Kaufman, San Mateo, Calif.), (later reported in IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 12, pp. 1167-1178, December, 1990). Using the suggestions of Bourlard et al., the present inventors have built and verified that multi-layer perceptrons can be used in hybrid hidden Markov model/neural network speech recognition systems, and the inventors have identified areas of needed improvement in the state of the art of fully computerized speech recognition systems.
The hybrid HMM/MLP-based context-independent approaches have heretofore not been successfully extended to estimate context-dependent probabilities of phonetic classes of speech information. One context-dependent HMM/MLP approach has also been suggested, but it remains unproven. (Bourlard et al., "CDNN: A Context Dependent Neural Network For Continuous Speech Recognition," Proc. of International Conference on Acoustics, Speech and Signal Processing 92, published March, 1992.)
Context-dependent modeling has been difficult with previous MLPs due to a great increase in the number of parameters resulting from a straight-forward extension of the simpler context-independent approach. Moreover, the inherent discriminative nature of conventional MLP training algorithms appears to make it difficult to model phonetic classes with multiple distributions as in a conventional HMM speech recognition processor.
What is needed is a practical context-dependent hybrid HMM/MLP approach to the speech recognition problem.