General speaking, a spoken dialogue system comprises a speech recognition module and a language understanding module. The speech recognition module transfers the speaker's utterance into a possible sentence list, then the language understanding module uses language and domain knowledge, i.e., grammar, to analyze the sentence list and to form a machine-readable format of speaker's utterance for further processing. FIG. 1 illustrates the flow of the recognition and understanding process in a spoken dialogue system. As shown in FIG. 1, the input utterance signal “tell me forecast in Taipei tonight” 1, after being recognized by the speech recognition module 2, will be represented as sentence list 3. As shown in this example, the sentence list 3 comprises at least one possible word sequence such as “tell me forecast in Taipei tonight” 311, “Miami forecast in Taipei tonight” 312 and “tell me forecast in Taipei at night” 313. The possible sentence list will then be analyzed by the language understanding module 4 and the semantic frame 5, the machine-readable format of speaker's utterance, will be generated.
The speech recognition error is inevitable in spoken dialogue system. Users' utterances in real-world applications are usually disfluent, noisy and ungrammatical. These characteristics seriously degrade the accuracy of speech recognizer and consequently lower the performance of dialogue system. To this end, an error-tolerant language understanding method is highly demanded for dialogue systems.
Recently, the concept-based approach in language understanding (Please refer to U.S. Pat. No. 5,754,736, “System and method for outputting spoken information in response to input speech signals,” May 19, 1998.; Kellner, A., B. Rueber, F. Seide and B.-H. Tran, “PADIS—An automatic telephone switchboard and directory information system,” Speech Communication, Vol. 23, 1997, pp.95-111.; Yang, Y.-J. and L.-S. Lee, “A Syllable-Based Chinese Spoken Dialogue System for Telephone Directory Services Primarily Trained with a Corpus,” in Proceedings of the 5th International Conference on Spoken Language Processing, 1998, pp. 1247-1250.;Nagai, A. and Y. Ishikawa, “Concept-driven speech understanding incorporated with a statistic language model,” in Proceedings of the 5th International Conference on Spoken Language Processing, 1998, pp. 2235-2238.) has been widely adopted in dialogue systems to understand users' utterances because of its capability of handling the erroneous and ungrammatical sentence hypotheses generated by the speech recognizer. In this approach, the output from the speech recognizer is first parsed into a concept graph according to a predefined concept grammar. Each path in the concept graph represents one possible concept sequence for the input utterance. Then, some stochastic language models, such as concept-bigram model, are used to find the most probable concept sequence. Since the concept-based approach does not require the input sentences to be fully grammatical, it is robust in handling the sentence hypotheses mixed with speech recognition errors.
FIG. 2 illustrates an embodiment of the concept-based approach. As shown in FIG. 2, the word sequence “tell me forecast in Taipei tonight” is parsed into a forest of concept parses, wherein the forest of concept parses comprises four concept parse tree 11; that is, the word sequence has been parsed into Query 7, Topic 8, Location 9 and Date 10 four concepts. In the embodiment, the words “tell me” have been parsed to be Query 7, the word “forecast” has been parsed to be Topic 8, the word “in” and the word “Taipei” (city) has been parsed to be Location 9, and the word “tonight” has been parsed to be Date 10. The concept sequence 6 corresponding to the forest is Query Topic Location Date.
Although the concept-based approach has the possibility to select the sentence hypothesis with least recognition errors, it is not able to determine whether the selected one is erroneous or not, not to mention recovering the errors. The stochastic language model in the concept-based approach is used to assess the relative possibilities of all possible concept sequences and select the most probable one. The scores obtained from the language model can be used for a comparison of competing concept sequences, but not for an assessment of the probability that the selected concept sequence is correct. However, due to imperfect speech recognition, there is always a possibility that the selected concept sequence has errors, even if it is the most probable concept sequence. For example, if the output sentence list of recognizing the utterance “tell me forecast in Taipei tonight” comprises only two erroneous sentences such as “Miami forecast in Taipei tonight” and “tell me forecast in Taipei at night”, the language understanding module will be forced to pick up a possible one from the erroneous concept sequences Location (Miami) Query (tell me) Topic (forecast) Location (in Taipei) Date (tonight) and Query (tell me) Topic (forecast) Location (in Taipei) Time (at night).
The major problem of the concept-based approach is that the definition of concept sequence is not specific, so it cannot detect if error happened in the concept sequence, not to mention to correct the error. To this end, we proposed an error-tolerant language understanding method (Please refer to Lin, Y.-C. and H.-M. Wang, “Error-tolerant Language Understanding for Spoken Dialogue Systems,” in Proceedings of the 5th International Conference on Spoken Language Processing, Vol. 1, 2000, pp. 242-245.) to improve the robustness of dialogue systems by detecting and recovering the errors arising from speech recognition. The basic idea of the error-tolerant model is using exemplary concept sequences to provide the clues for detecting and recovering errors. In this approach, a concept parser first parses the output of the speech recognizer to concept sequences. Then, a dynamic programming procedure is used to find the most matched pair of concept sequences in the parsed concept sequences and the exemplary concept sequences. The corresponding edit operations (i.e., insertion, deletion, or substitution) are used to determine which concepts in the most probable concept sequence are incorrect and how to recover the errors.
In our previous work, the penalty of an edit operation is obtained from the prior probability of the edit operation. This method is simple to implement but not very accurate. For the robustness consideration, the utterance verification technique (Please refer to Sukkar, R. A. and Lee, C.-H., “Vocabulary Independent Discriminative Utterance Verification for Nonkeyword Rejection in Subword Based Speech Recognition,” in IEEE Trans. Speech and Audio Proc., 1996, 4(6):420-429.; Rueber, B., “Obtaining Confidence Measures from Sentence Probabilities,” in Proceedings of Eurospeech, 1997, pp.739-742.) provides the confidence level for the result of the speech recognition. Thus, the confidence level gives a strong hint of how the concept to be edited. If the confidence level of a concept is high, the concept tends to be retained; otherwise, it tends to be deleted or substituted. In this invent, we incorporate the confidence measurement into the error-tolerant language understanding model. In our design, the penalty of an edit operation is assessed according to the confidence measure of the corresponding concept. The experimental results show that the new model achieves more improvements than the previous one on understanding the utterances of cellular phone calls. Compared to the concept-based approach, the enhanced error-tolerant model improves the precision rate of concept from 65.09% to 76.32% and the recall rate from 64.21% to 69.13%.