As is known, automatic speech recognition systems (ASRs) are designed to convert a digital representation of a voice signal, which conveys the speech, into a textual sequence of words, which hypothesizes the lexical content of the voice signal itself. The automatic recognition process uses stochastic acoustic models, hence, the result produced, in terms of sequence of recognized words, may be affected by an other than zero residual error rate. Furthermore, the domain of the formulations recognized by an automatic speech recognition system is in any case linked to a limited vocabulary, formalized by means of a statistical language model or context-free grammars, which can be reduced to finite-state automata (this is the case, for example, of a grammar that describes the way to pronounce a date or a time). The pronunciation of words outside the vocabulary or of formulations that are not envisaged consequently generates recognition errors. It is therefore to be hoped that an automatic speech recognition system will have available a measure of the reliability of the recognized words.
For the above purpose, automatic speech recognition systems provide what in the literature is known as confidence measure, which is a reliability indicator comprised between 0 and 1, and which can be applied to the individual recognized words and/or to their sequence. In the event of recognition error, the confidence measure should assume low values, and in any case ones lower than those that are assumed in the absence of errors. A threshold on the confidence values measured can be fixed so as to prevent proposal of results that are not so reliable.
The most advanced automatic speech recognition systems enable recognition within flexible vocabularies, which are defined by the user and described by means of appropriate formalisms. To achieve this result, the voice models used for recognition are made up of elementary acoustic-phonetic units (APUs) or sub-words, the sequential composition of which enables representation of any word of a given language. The mathematical tools used for describing the temporal evolution of the voice are the so-called hidden Markov models (HMMs), and each elementary acoustic-phonetic unit is represented by a hidden Markov model, which is formed by states that describe the temporal evolution thereof. The words to be recognized, which are described as sequences of elementary acoustic-phonetic units, are obtained by concatenating individual constituent hidden Markov models.
In addition to describing the temporal evolution of the voice, hidden Markov models enable generation of the likelihoods of emission, also known as output likelihoods, of the acoustic states that form them, given the observation vectors that convey the information of the voice signal. The sequence of the probabilities, together with their temporal evolution, enables the recognition result to be obtained.
The likelihood of emission of the acoustic states of the hidden Markov models can be obtained using a characterizing statistical model, in which the distributions of the observation vectors, which summarize the information content of the voice signal at discrete time quanta, hereinafter referred to as frames, are, for example, represented by mixtures of multivariate Gaussian distributions. By means of training algorithms, which are known in the literature as Segmental K-Means and Forward Backward, it is possible to estimate mean value and variance of the Gaussian distributions of the mixtures, starting from pre-recorded and annotated voice databases. For a more detailed description of hidden Markov model theory, algorithms and implementation, reference may be made to Huang X., Acero A., and Hon H. W., Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice Hall, Chapter 8, pages 377-413, 2001.
Another method for obtaining the likelihood of emission of hidden Markov models is to use discriminative models, which tend to highlight the peculiarities of the individual models as compared to others. A technique that is known in the literature and is widely used is that of artificial neural networks (ANNs), which represent nonlinear systems capable of producing, following upon training, the likelihoods of emission of the acoustic states, given the acoustic observation vector. Recognition systems of this type are generally referred to as Hybrid HMM-NNs.
The detail of the elementary acoustic-phonetic units used as components for composition of the word models may depend upon the type of modelling of the likelihoods of emission of Markov states. In general, when recourse is had to characterizing models (mixtures of multivariate Gaussian distributions), contextual elementary acoustic-phonetic units are adopted. An exemplifying case is that of triphones, which represent the basic phonemes of a given language that are specialized within the words in their left-hand and right-hand contexts (adjacent phonemes). In the case of elementary acoustic-phonetic units trained using discriminative training (for example, with the ANN technique), the need for contextualization may be less marked; in this case, use may be made of context-independent phonemes or of composition of stationary units (i.e., the stationary part of the context-independent phonemes) and transition units (i.e., transition biphones between phonemes), as described in L. Fissore, F. Ravera, P. Laface, Acoustic-Phonetic Modelling for Flexible Vocabulary Speech Recognition, Proc. of EUROSPEECH, pp. I 799-802, Madrid, Spain, 1995.
A method that is widely employed in the prior art to compute the confidence consists in the use of the so-called a posteriori likelihoods, which are quantities derived from the emission likelihoods of the hidden Markov models. The logarithm of the a posteriori likelihoods, calculated for each frame, can be averaged by weighting all the frames equally or else weighting all the phonemes equally, as described in Rivlin, Z. et al., A Phone-Dependent Confidence Measure for Utterance Rejection, Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, Atlanta, Ga., pp. 515-517 (May 1996). Similar criteria can be used also in Hybrid HMM-NN systems, which are able to produce directly the a posteriori likelihoods of the states/phonemes. In Bernardis G. et al., Improving Posterior Based Confidence Measures in Hybrid HMM/ANN Speech Recognition System, Proceedings of the International Conference on Spoken Language Processing, pp. 775-778, Sydney, Australia (December 1998), a comparison is provided of different ways of averaging a posteriori likelihoods to obtain confidence measures on a phoneme basis or on a word basis.
Another widely used technique envisages normalizing the a posteriori likelihoods, or directly the emission likelihoods, that concur in the confidence computation by means of a factor that does not take into account the lexical and grammatical recognition constraints. The comparison of the two quantities, i.e., the result obtained applying the constraints and the result obtained relaxing the constraints, provides information useful for determining the confidence. In fact, if the two quantities have comparable values, it means that the introduction of the recognition constraints has not produced any particular distortion with respect to what would have happened without recognition constraints. The recognition result may therefore be considered reliable, and its confidence should have high values, close to its upper limit. When, instead, the constrained result is considerably worse than the unconstrained result, it may be inferred that the recognition is not reliable in so far as the automatic speech recognition system would have produced a result different from the one obtained as a consequence of the application of the constraints. In this case, the confidence measure ought to produce low values, close to its lower limit.
Various embodiments of this technique have been proposed in the literature. In Gillick M. et al., A Probabilistic Approach to Confidence Estimation and Evaluation, Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, Munich, Germany, pp. 879-882 (May 1997), the difference between quantities known as acoustic score and best score is adopted, where the two terms are obtained respectively by averaging the acoustic score (with constraints) and the best score (without constraints), produced for each frame by the acoustic hidden Markov models on the time interval corresponding to the words. Likewise, in U.S. Pat. No. 5,710,866 to Alleva et al., the confidence is computed as the difference between a constrained acoustic score and an unconstrained acoustic score, and this difference is calculated for each frame so as to be usable for adjusting the constrained acoustic score employed during recognition. Weintraub, M. et al., Neural Network Based Measures of confidence for Word Recognition, Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, Munich, Germany, pp. 887-890 (May 1997), proposes a family of confidence measures based upon acoustic features that differ for the type of models used as normalization factor and for the level at which the logarithms of the likelihoods of the hidden Markov models for the various frames are combined (word level, phone level, phone-state level). The models used as normalization may be context-independent phonemes or Gaussian-mixture models (GMMs) and enable the result to be obtained in the case of absence of recognition constraints.
A further formulation, which refers to a hybrid HMM-NN system, is proposed in Andorno M. et al., Experiments in Confidence Scoring for Word and Sentence Verification, Proc. of the International Conference on Spoken Language Processing, pp. 1377-1380, Denver, Colo. (September 2002). In this case, the confidence is obtained as a ratio between the unconstrained acoustic score and the constrained acoustic score. The numerator is calculated as the average, on the number of frames of the word, of the logarithms of the best a posteriori likelihood between all the states of the acoustic models, whereas the denominator is represented by the average of the a posteriori likelihoods over the sequence of states that is produced by the so-called Viterbi alignment. For a more detailed description of the Viterbi alignment, reference may be made to the above-referenced Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, chapter 8.
Likewise known in the prior-art are techniques for improving the confidence measure. In U.S. Pat. No. 6,421,640, for example, the confidence is adapted via an offset specific for the user or the phrase pronounced, prior to being compared with the threshold for deciding whether to propose the recognition result to the user.
In U.S. Pat. No. 6,539,353 the improvement in the quality of the confidence measure is obtained by applying a specific weighting to sub-word confidence measures, which are then combined to obtain the word confidence measure.
The variety of elementary acoustic-phonetic units, also within the same automatic speech recognition system, causes a lack of homogeneity in the emission likelihoods (a posteriori likelihoods), which are affected by the detail of the elementary acoustic-phonetic units, by their occurrence in the words of a given language, by the characteristics of the sounds that they represent, by the amount of training material available for their estimation, and so forth. Since the confidence measures known in the literature are derived from the emission likelihoods, variability of the latter produces instability and lack of homogeneity in the confidence itself.