The function of automatic speech recognition (ASR) systems is to determine the lexical identity of spoken utterances. The recognition process, also referred to as classification, typically begins with the conversion of an analog acoustical signal into a stream of digitally represented spectral vectors or frames which describe important characteristics of the signal at successive time intervals. The classification or recognition process is based upon the availability of reference models which describe aspects of the behavior of spectral frames corresponding to different words. A wide variety of models have been developed but they all share the property that they describe the temporal characteristics of spectra typical to particular words or sub- word segments. The sequence of spectral vectors arising from an input utterance is compared with the models and the success with which models of different words predict the behavior of the input frames, determines the putative identity of the utterance.
Currently most systems utilize some variant of a statistical model called the Hidden Markov Model (HMM). Such models consist of sequences of states connected by arcs, and a probability density function (pdf) associated with each state describes the likelihood of observing any given spectral vector at that state. A separate set of probabilities may be provided which determine transitions between states.
The process of computing the probability that an unknown input utterance corresponds to a given model, also known as decoding, is usually done in one of two standard ways. The first approach is known as the Forward-Backward algorithm, and uses an efficient recursion to compute the match probability as the sum of the probabilities of all possible alignments of the input sequence and the model states permitted by the model topology. An alternative, called the Viterbi algorithm, approximates the summed match probability by finding the single sequence of model states with the maximum probability. The Viterbi algorithm can be viewed as simultaneously performing an alignment of the input utterance and the model and computing the probability of that alignment.
HMMs can be created to model entire words, or alternatively, a variety of sub-word linguistic units, such as phonemes or syllables. Phone-level HMMs have the advantage that a relatively compact set of models can be used to build arbitrary new words, given that their phonetic transcription is known. More sophisticated versions reflect the fact that contextual effects can cause large variations in the way different phones are realized. Such models are known as allophonic or context-dependent. A common approach is to initiate the search with relatively inexpensive context-independent models and re-evaluate a small number of promising candidates with context-dependent phonetic models.
As in the case of the phonetic models, various levels of modeling power are available in the case of the probability densities describing the observed spectra associated with the states of the HMM. There are two major approaches: the discrete pdf and the continuous pdf. In the former, the spectral vectors corresponding to the input speech are first quantized with a vector quantizer which assigns each input frame an index corresponding to the closest vector from a codebook of prototypes. Given this encoding of the input, the pdfs take on the form of vectors of probabilities, where each component represents the probability of observing a particular prototype vector given a particular HMM state. One of the advantages of this approach is that it makes no assumptions about the nature of such pdfs, but this is offset by the information loss incurred in the quantization stage.
The use of continuous pdfs eliminates the quantization step, and the probability vectors are replaced by parametric functions which specify the probability of any arbitrary input spectral vector given a state. The most common class of functions used for this purpose is the mixture of Gaussians, where arbitrary pdfs are modeled by a weighted sum of Normal distributions. One drawback of using continuous pdfs is that, unlike in the case of the discrete pdf, the designer must make explicit assumptions about the nature of the pdf being modeled--something which can be quite difficult since the true distribution form for the speech signal is not known. In addition, continuous pdf models are computationally far more expensive than discrete pdf models, since following vector quantization the computation of a discrete probability involves no more than a single table lookup.
The probability values in the discrete pdf case and the parameter values of the continuous pdf are most commonly trained using the Maximum Likelihood method. In this manner, the model parameters are adjusted so that the likelihood of observing the training data given the model is maximized. However, it is known that this approach does not necessarily lead to the best recognition performance and this realization has led to the development of new training criteria, known as discriminative, the objective of which is to adjust model parameters so as to minimize the number of recognition errors rather than fit the distributions to the data.
As used heretofore, discriminative training has been applied most successfully to small-vocabulary tasks. In addition, it presents a number of new problems, such as how to appropriately smooth the discriminatively-trained pdfs and how to adapt these systems to a new user with a relatively small amount of training data.
To achieve high recognition accuracies, a recognition system should use high-resolution models which are computationally expensive (e.g., context-dependent, discriminatively-trained continuous density models). In order to achieve real-time recognition, a variety of speedup techniques are usually used.
In one typical approach, the vocabulary search is performed in multiple stages or passes, where each successive pass makes use of increasingly detailed and expensive models, applied to increasingly small lists of candidate models. For example, context independent, discrete models can be used first, followed by context-dependent continuous density models. When multiple sets of models are used sequentially during the search, a separate simultaneous alignment and pdf evaluation is essentially carried out for each set.
In other prior art approaches, computational speedups are applied to the evaluation of the high-resolution pdfs. For example, Gaussian-mixture models are evaluated by a fast but approximate identification of those mixture components which are most likely to make a significant contribution to the probability and a subsequent evaluation of those components in full. Another approach speeds up the evaluation of Gaussian-mixture models by exploiting a geometric approximation of the computation. However, even with speedups the evaluation can be slow enough that only a small number can be carried out.
In another scheme, approximate models are first used to compute the state probabilities given the input speech. All state probabilities which exceed some threshold are then recomputed using the detailed model, the rest are retained as they are. Given the new, composite set of probabilities a new Viterbi search is performed to determine the optimal alignment and overall probability. In this method, the alignment has to be repeated, and in addition, the approximate and detailed probabilities must be similar, compatible quantities. If the detailed model generates probabilities which are significantly higher than those from the approximate models the combination of the two will most likely not lead to satisfactory performance. This requirement constrains this method to use approximate and detailed models which are fairly closely related and thus generate probabilities of comparable magnitude. It should also be noted that in this method there is no guarantee that all of the individual state probabilities that make up the final alignment probability come from detailed models.
The present invention represents a novel approach to the efficient use of high-resolution models in large vocabulary recognition. The proposed method benefits from the use of a continuous density model and a discriminative training criterion which leads to a high recognition performance on a large vocabulary task at the cost of only a marginal increase of computation over a simple discrete pdf system. Another novel feature of the new approach is its ability to make use of limited quantities of new data for rapid adaptation to a particular speaker.
As was mentioned above, the probability that an input utterance corresponds to a given HMM can be computed by the Viterbi algorithm, which finds the sequence of model states which maximizes this probability. This optimization can be viewed as a simultaneous probability computation and alignment of the input utterance and the model.
In accordance with one aspect of the present invention, it has been determined that the alignment paths obtained with relatively computationally inexpensive discrete pdf models can be of comparable quality to those obtained with computationally costly continuous density pdf models, even though the match probabilities or metrics generated by the discrete pdf alignment do not lead to sufficiently high accuracy for large vocabulary recognition.
In accordance with another aspect of the invention, there is provided a decoupling of the alignment and final probability computation tasks. A discrete-pdf system is used to establish alignment paths of an input utterance and a reference model, while the final probability metric is obtained by post-processing frame-state pairs with more powerful, discriminatively trained continuous-density pdfs, but using the same alignment path.
Unlike conventional systems, where model states are characterized by one particular type of observed pdf, the state models in the present system are thus associated with both a discrete (low-resolution) pdf and a discriminatively trained, continuous-density (high-resolution) pdf. The high-resolution pdfs are trained using alignments of models and speech data obtained using the low-resolution pdfs, and thus the discriminative training incorporates knowledge of the characteristics of the discrete pdf system.