Technical Field
The invention relates to speech recognition. More particularly, the invention relates to efficient empirical determination, computation, and use of an acoustic confusability measure.
Description of the Prior Art
In United States Patent Application Publication No. 20020032549, it is stated:
In the operation of a speech recognition system, some acoustic information is acquired, and the system determines a word or word sequence that corresponds to the acoustic information. The acoustic information is generally some representation of a speech signal, such as the variations in voltage generated by a microphone. The output of the system is the best guess that the system has of the text corresponding to the given utterance, according to its principles of operation.
The principles applied to determine the best guess are those of probability theory. Specifically, the system produces as output the most likely word or word sequence corresponding to the given acoustic signal. Here, “most likely” is determined relative to two probability models embedded in the system: an acoustic model and a language model. Thus, if A represents the acoustic information acquired by the system, and W represents a guess at the word sequence corresponding to this acoustic information, then the system's best guess W* at the true word sequence is given by the solution of the following equation:W*=argmaxWP(A|W)P(W).Here P(A|W) is a number determined by the acoustic model for the system, and P(W) is a number determined by the language model for the system. A general discussion of the nature of acoustic models and language models can be found in “Statistical Methods for Speech Recognition,” Jelinek, The MIT Press, Cambridge, Mass. 1999, the disclosure of which is incorporated herein by reference. This general approach to speech recognition is discussed in the paper by Bahl et al., “A Maximum Likelihood Approach to Continuous Speech Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume PAMI-5, pp. 179-190, March 1983, the disclosure of which is incorporated herein by reference.
The acoustic and language models play a central role in the operation of a speech recognition system: the higher the quality of each model, the more accurate the recognition system. A frequently-used measure of quality of a language model is a statistic known as the perplexity, as discussed in section 8.3 of Jelinek. For clarity, this statistic will hereafter be referred to as “lexical perplexity.” It is a general operating assumption within the field that the lower the value of the lexical perplexity, on a given fixed test corpus of words, the better the quality of the language model.
However, experience shows that lexical perplexity can decrease while errors in decoding words increase. For instance, see Clarkson et al., “The Applicability of Adaptive Language Modeling for the Broadcast News Task,” Proceedings of the Fifth International Conference on Spoken Language Processing, Sydney, Australia, November 1998, the disclosure of which is incorporated herein by reference. Thus, lexical perplexity is actually a poor indicator of language model effectiveness.
Nevertheless, lexical perplexity continues to be used as the objective function for the training of language models, when such models are determined by varying the values of sets of adjustable parameters. What is needed is a better statistic for measuring the quality of language models, and hence for use as the objective function during training.
United States Patent Application Publication No. 20020032549 teaches an invention that attempts to solve these problems by:
Providing two statistics that are better than lexical perplexity for determining the quality of language models. These statistics, called acoustic perplexity and the synthetic acoustic word error rate (SAWER), in turn depend upon methods for computing the acoustic confusability of words. Some methods and apparatuses disclosed herein substitute models of acoustic data in place of real acoustic data in order to determine confusability.
In a first aspect of the invention taught in United States Patent Application Publication No. 20020032549, two word pronunciations l(w) and l(x) are chosen from all pronunciations of all words in fixed vocabulary V of the speech recognition system. It is the confusability of these pronunciations that is desired. To do so, an evaluation model (also called valuation model) of l(x) is created, a synthesizer model of l(x) is created, and a matrix is determined from the evaluation and synthesizer models. Each of the evaluation and synthesizer models is preferably a hidden Markov model. The synthesizer model preferably replaces real acoustic data. Once the matrix is determined, a confusability calculation may be performed. This confusability calculation is preferably performed by reducing an infinite series of multiplications and additions to a finite matrix inversion calculation. In this manner, an exact confusability calculation may be determined for the evaluation and synthesizer models.
In additional aspects of the invention taught in United States Patent Application Publication No. 20020032549, different methods are used to determine certain numerical quantities, defined below, called synthetic likelihoods. In other aspects of the invention, (i) the confusability may be normalized and smoothed to better deal with very small probabilities and the sharpness of the distribution, and (ii) methods are disclosed that increase the speed of performing the matrix inversion and the confusability calculation. Moreover, a method for caching and reusing computations for similar words is disclosed.
Such teachings are yet limited and subject to improvement.