Speech recognition may be defined as the process of converting a spoken waveform into a textual string of words, such as, for example, a sentence expressed in the English language. In a front-end phase, “raw” speech signals are spectrally analyzed and converted into a sequence of feature vectors (observations). In an acoustic modeling phase, the sequence of feature vectors is examined to extract phone sequences (e.g., simple vowel or consonant sounds) using knowledge about acoustic environments, gender and dialect differences, and phonetics. In a language modeling phase, the phone sequences are converted into corresponding word sequences using knowledge of what constitutes a possible word, what words are likely to occur, and in what sequence. A spoken language processing system makes use of the word sequences from the speech recognition system and produces different levels of meaning representations. Examples of such spoken language processing systems include spoken language understanding, information extraction, information retrieving, or dialogue systems.
Due to the complexity and intricacies of language combined with varied acoustic environments, speech recognition systems face significant challenges in realizing a truly human-like speech recognition system. For example, a speech recognition system must contend with lexical and grammatical complexity and variations of spoken language as well as the acoustic uncertainties of different accents and speaking styles. A speech recognition system's determination from the spoken waveforms of a speech element, such as a word or sentence is therefore often incorrect.
Therefore, the speech recognition system calculates a degree of confidence, referred to herein as a confidence score, for the determined speech elements. If the calculated score is low, a spoken dialogue system that uses the speech recognition system may discard the determined speech elements and, for example, requests new input. For example, the system may output a message requesting the speaker to repeat a word or sentence.
Indeed, there has been considerable interest in the speech recognition community in obtaining confidence scores for recognized words (see, e.g., Weintraub et al., Neural Network Based Measures of Confidence for Word Recognition, Proc. ICASSP-97, Vol. 2, pages 887-890 (1997); Zhang et al., Word Level Confidence Annotation Using Combinations of Features, Proc. Eurospeech, Aalborg, pages 2105-2108 (2001)) or utterances (see, e.g., San-Segundo et al., Confidence Measures for Spoken Dialogue Systems, ICASSP (2001); Wang et al. Error-Tolerant Spoken Language Understanding with Confidence Measuring, ICSLP-2002). Computing confidence scores at the concept-level may have gained more attention due to increased research activities and real world applications in dialogue and information retrieving and extraction (see, e.g., Ammicht et al. Ambiguity Representation and Resolution in Spoken Dialogue Systems, Proc. Eurospeech (2001); Guillevic et al., Robust Semantic Confidence Scoring, Proc. ICSLP, pages 853-856 (2002)).
To calculate the confidence score, the speech recognition system inputs a set of data into a statistical model, e.g., a maximum entropy model (see, e.g., Berger et al., A Maximum Entropy Approach to Natural Language Processing, Computational Linguistic, 22 (1): 39-71 (1996); Zhou et al., A Fast Algorithm for Feature Selection in Conditional Maximum Entropy Modeling, Proceedings of Empirical Methods in Natural Language Processing, Sapporo, Japan (Jul. 11-12, 2003)), which outputs the confidence score. The input data set may include numerous features that bear upon the organization of a speech element. For each different feature, a different weighting may be applied so that certain features bear more strongly on the calculation than others.
For example, the maximum entropy model outputs a probability that the input signals represent a particular speech element y given an observation x subject to constraints set by a set of selected features fi(x, y), where fi(x,y) is a feature function (or feature for short) that describes a certain acoustic, linguistic, or other event (x,y). For a particular observation, particular ones of the features are either present (1) or absent (0). The maximum entropy model may take the form of
            p      ⁡              (                  y          |          x                )              =                  1                  Z          ⁡                      (            x            )                              ⁢              exp        (                              ∑            j                    ⁢                                    λ              j                        ⁢                                          f                j                            ⁡                              (                                  x                  ,                  y                                )                                                    )              ,where λj is a weight assigned to a feature fj indicating how important the feature fj is for the model, Z(x) is a normalization factor, and p(y|x) is the resulting conditional probability.
Conventional spoken language processing systems, driven mostly by relatively simple dialogue, information retrieving, and information extraction applications, are limited to computation of confidence scores at only a word level, a domain dependent semantic slot or concept level, e.g., where “New York” is a single semantic slot in a travel related domain, or sentence/utterance level that focuses on special phrases for a task context of the speech recognition system.
The confidence score computation algorithms at the three levels, i.e., word, concept, and sentence, may achieve good results when applied to relatively simple spoken language processing systems, including simple dialogue systems, such as command-n-control, or slot-based dialogue systems. For more sophisticated dialogue systems (see, e.g., Weng et al., CHAT: A Conversational Helper for Automotive Tasks, Proceedings of the 9th International Conference on Spoken Language Processing (Interspeech/ICSLP), pages 1061-1064, Pittsburgh, Pa. (September 2006)), however, use of the three level confidence score paradigm may result in ineffective and annoying dialogues. For example, the system might constantly ask a user to repeat the user's request numerous times since the system only identifies individual words or whole sentences in which the system does not have confidence, and thus requires the user to repeat the entire sentence.