Speech is typically input to a speech recognition system using an analog transducer, such as a microphone, and converted to digital form. Signal pre-processing consists of computing a frame sequence of acoustic feature vectors by processing the speech samples in successive time intervals. In some systems, a clustering technique is used to convert these continuous-valued features to a sequence of discrete code words drawn from a code book of acoustic prototypes. Recognition of an unknown exemplar or speech utterance involves transforming the extracted frame sequence into an appropriate message. The recognition process is typically constrained by a set of acoustic models which correspond to the basic units of speech or speech signal classes employed in the recognizer, a lexicon which defines the vocabulary of the recognizer in terms of these basic units, and a language model which specifies allowable sequences of vocabulary items. The acoustic models, and in some cases the language model and lexicon, are learned from a set of representative training data or training exemplars.
One recognition paradigm frequently employed in speech recognition is the neural network. A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected via unidirectional signal channels called connections. Each processing element may possess a local memory and carry out localized information processing operations. Each processing element has many inputs and a single output that fans out into as many co-lateral connections as desired. The inputs to a processing element have a connection weight. The process of learning a given task by a neural network, such as recognizing a frame sequence to classify a speech signal, is the weight adaptation in which a connection weight changes as a non-linear function of the current connection weight, the internal excitation state of the neuron, and the current input to the neuron at that connection. The output of the neuron is a non-linear function of its internal excitation, such as the sigmoid function.
Many neural net architectures can be trained for strong interclass discriminative properties.
However, neural networks often lack the time normalization characteristics desired for speech signal processing. Because of speaker variability, different exemplars from the same speech signal class may vary in temporal scale. Time dilations and compressions among exemplars of the same class greatly reduce the reliability of the neural network due to the neural network's lack of time normalization characteristics.
Time-delay neural network architectures, which are somewhat capable of time normalization, do exist. However, time-delay neural network architectures are very complex, and have not found wide acceptance in the art of speech recognition. Thus, using a time-delay neural network for speech recognition is not very practical.
Another recognition paradigm frequently employed in speech recognition is the hidden Markov model. Hidden Markov modeling is a probabilistic pattern matching technique which is more robust than neural networks at modeling durational and acoustic variability among exemplars of a speech signal class. A hidden Markov model is a stochastic model which uses state transition and output probabilities to generate state sequences. Hidden Markov models represent speech as a sequence of states, which are assumed to model frames of speech with roughly stationary acoustic features. Each state is characterized by an output probability distribution which models acoustic variability in the spectral features or observations associated with that state. Transition probabilities between states model evolutionary characteristics and durational variabilities in the speech signal. The probabilities, or parameters, of a hidden Markov model are trained using frames extracted from a representative sample of speech data. Recognition of an unknown exemplar is based on the probability that the exemplar was generated by the hidden Markov model.
One hidden Markov model based speech recognition technique involves determining an optimal state sequence through a hidden Markov model to represent an exemplar, using the Viterbi algorithm. The optimal state sequence is defined as the state sequence which maximizes the probability of the given exemplar in a particular hidden Markov model. During speech recognition, an optimal state sequence is determined for each of a plurality of hidden Markov models. Each hidden Markov model represents a particular speech signal class of the speech recognition system vocabulary. A likely hidden Markov model is selected from the plurality of hidden Markov models to determine the likely speech signal class.
Training hidden Markov model based recognizers involves estimating the parameters for the word models used in the system. Parameters for the models are chosen based on a maximum likelihood criteria. That is, the parameters maximize the likelihood of the training data being produced by the model. This maximization is performed using the Baum-Welch algorithm, a re-estimation technique based on first aligning the training data with the current models, and then updating the parameters of the models based on this alignment. Because the hidden Markov models are trained on a class-by-class basis, interclass distinction may be rather poor.
Attempts have been made to train all classes simultaneously based on maximum mutual information criteria. However, mathematical manipulations are complicated, algorithms are not very practical, and many assumptions must be made. Thus, training hidden Markov models for strong interclass distinction is not very practical.