Key-word spotting and rule-based approaches have been widely used techniques in the area of spoken language understanding partly because of their robustness to speech recognition and human errors and partly because of their adequate applications in relatively simple application domains dealing with relatively simple language subsets. See, for example, Jackson et al. “A template matcher for robust natural language interpretation”, In Proceedings of DARPA Speech and Natural Language Workshop, 1991, Seneff, “TINA: a natural language system for spoken language applications, Computational Linguistics. Vol. 18, No. 1. pp. 61-86, 1992, Dowding, et al. “Interleaving Syntax and Semantics in an Efficient Bottom-Up Parser”, In the Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, N.Mex., 1994.
Commercial speech companies have adopted key-word spotting and rule-based techniques in their products. See, for example, products by Nuance Communications, Inc. 2001, including the Nuance Speech Recognition System, 8.0: Grammar Developer's Guide, and by Scansoft, Inc. 2004, including the ScanSoft VoCon 3200: Software Development Kit, Developer's Guide. Version 2.0. February, 2004.
As more applications are included spoken dialog systems, the limited usage of spoken language imposed by the systems, shallow understanding, and/or miscommunication with the system, may annoy the users and delay the adoption of such systems in environments.
In this regard, data sparseness may be a common problem for statistical-based approaches when there is not enough data for training. This may be particularly important when a new application needs to be developed quickly to respond to a market need. However, where a model uses too many parameters to fit a training data set, the resulting model may become overlit to the particular data set without much robustness to account for any unforeseen data.
In the past, smoothing methods for maximum entropy modeling have attracted attention in language process and information retrieval research, among which, using Gaussian priors in smoothing has been successful, especially when data is sparse. However, when applying the priors, past research work has simply used a cut-off technique for feature selection and therefore Gaussian prior is only applied during parameter computation. In other words, past work only addresses the data sparseness issue without considering the data overfitting issue.
Before mature speech recognition technologies were available, understanding spoken language was primarily investigated under the subject of dealing with extra-grammaticality, then an important topic in computational linguistics, as discussed, for example by J. Carbonell & P. Hayes in “Recovery Strategies for Parsing Extragrammatical Language,” American Journal of Computational Linguistics, Vol. 9 (3-4), 1983, D. Hindle, in “Deterministic Parsing of Syntactic Non-fluencies,” Proceedings of 21st Annual Meeting of Association for Computational Linguistics, pp. 123-128, 1983, and W. Levelt in “Monitoring and Self-repair in Speech,” Cognition, 14:41-104, 1983. With the push from DARPA HLT programs for more than a decade, research in this area has advanced to a new level.
Among others, key-word spotting and rule-based approaches have been widely used techniques in the area of spoken language understanding partly because of their robustness to speech recognition and human errors, and partly because of their adequate applications in relatively simple application domains dealing with relatively simple language subsets, as discussed, for example, in Jackson et al. in “A template matcher for robust natural language interpretation,” Proceedings of DARPA Speech and Natural Language Workshop, 1991, Seneff in “TINA: a natural language system for spoken language applications,” Computational Linguistics, Vol. 18, No. 1. pp. 61 to 86, 1992, Dowding et al. in “Interleaving Syntax and Semantics in an Efficient Bottom-Up Parser,” Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, N.Mex., 1994, and U.S. Pat. No. 6,292,767 entitled “Method and system for building and running natural language understanding systems,” which was filed on Dec. 21, 1995 and granted in 2001. Because of their success, even commercial speech companies, such as Nuance and Scansoft, have adopted these techniques in their products. See, for example, the Nuance Speech Recognition System, 8.0: Grammar Developer's Guide, published 2001 by Nuance Communications, Inc. 2001, and the ScanSoft VoCon 3200: Software Development Kit, Developer's Guide, Version 2.0, published February, 2004 by Scansoft, Inc. 2004. With the increase of new applications in a dialog system, the limited usage of spoken language imposed by the systems may annoy the users and delay the adoption of such systems in various fields.
In the area of parsing written texts, a related field to spoken language understanding, statistical methods have dominated the performance on the Wall Street Journal (WSJ) portion of Penn Treebank. See, for example, Michael Collins in “Three Generative, Lexicalised Models for Statistical Parsing. Proceedings of the 35th Annual Meeting of the ACL jointly with the 8th Conference of the EACL),” Madrid, 1997, Ratnaparkhi in “Maximum Entropy Models for Natural Language Ambiguity Resolution,” Ph.D. thesis, University of Pennsylvania, 1998, Collins in “Head-Driven Statistical Models for Natural Language Parsing,” Computational Linguistics, 2003, Klein et al. in “Accurate Unlexicalized Parsing,” Proceedings of 41 st Annual Meeting of Association for Computational Linguistics, 2003, and Bod in “An Efficient Implementation of a New DOP Model,” Proceedings EACL'03, Budapest, Hungary, 2003. In named entity (NE) recognition, as driven by DARPA information extraction programs, seven named entity categories have been proposed, i.e., person, organization, location, time, date, money, and percent, have been proposed. See Chinchor, “Overview of MUC7/MET-2”, In Proceedings of the Seventh Message Understanding Conference (MUC7) 1998. Other researchers continued this direction of work but with only four named entity (NE) types, e.g., person, organization, location, and miscellaneous. See, for example, De Meulder, “Memory-based Named Entity Recognition using Unannotated Data”, In Proceedings of CoNLL-2003, Edmnonton, Canada.
For the past decade, understanding aspects of spoken languages, such as using prosody for disfluencies and sentence boundaries, has also received a great deal of attention. See, for example, Shriberg “Preliminaries to a Theory of Speech Disfluencies” PhD thesis, University of California at Berkeley, 1994, Heeman, “Detecting and Correcting Speech Repairs”, Association for Computational Linguistics”, Las Cruces, N. Mex., 1994, Shriberg et al., “Prosody modeling for automatic speech understanding: an overview of recent research at SRI,” in Proc. ISCA Workshop on Speech Recognition and Understanding, pp. 13 to 16, 2001, Chamiak, “Edit Detection and Parsing for Transcribed Speech”, Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, pp 118 to 126, 2001.
Recently, statistical approaches have started to obtain more attention in the study of spoken language understanding. See, for example, He, “Robustness Issues in a Data-Driven Spoken Language Understanding System” HLT-NAACL 2004 Workshop on Spoken Language Understanding for Conversational Systems and Higher Level Linguistic Information for Speech Processing. Boston, USA, and Wutiwiwatchai et al., “Hybrid Statistical and Structural Semantic Modeling for Thai Multi-Stage Spoken Language Understanding” HLT-NAACL 2004 Workshop on Spoken Language Understanding for Conversational Systems and Higher Level Linguistic Information for Speech Processing. Boston, USA. Here, the approaches advocated by He and Wutiwiatchai et al. use a multi-stage understanding strategy, which is a strategy computational linguists attempted before, such as, for example, the strategies discussed by Frazier et al. “The Sausage Machine: A New Two-Stage Parsing Model”, Cognition, Volume 6, pp. 291 to 325, 1978, and Abney, “Parsing By Chunks”, In: Robert Berwick, Steven Abney and Carol Tenny (eds.), Principle-Based Parsing, Kluwer Academic Publishers, Dordrecht, 1991. (Frazier and Fodor 1978; Abney 1991). The approaches proposed by He and Wutiwiatchai et al., however, put an emphasis on modeling the spoken language modeling with statistical process. In particular, such approaches decompose the spoken language processing into speech recognition, semantic parsing and dialog act decoding. The semantic parser models and identifies domain semantic chunks using push-down automata with probabilistic moves, and the dialog decoder finds a dialog act based on the set of semantic concepts identified by the semantic parser with an extended naïve Bayesian algorithm.
Wutiwiatchai et al. discuss a three-stage approach, in which the first stage extracts predefined concepts from an input utterance using a weighted finite state transducer. See also, Riley et al. “Tranducer composition for context-dependent network expansion,” Proc. Eurospeech '97, Rhodes, Greece, September 1997. This three-stage approach is similar to the statistical parsing approach Nuance has taken, and the primary difference is in weighting different hypotheses. In the second stage, the goal or dialog act of the utterance is identified by a multi-layer neural network. The third stage converts the identified concept word strings into concept values.
Another line of research in spoken language understanding is directed at single level semantic classification. Pioneer work in this direction includes the AT&T HMIHY system, as discussed, for example, in Gorin, et al., “HOW MAY I HELP YOU?”, Speech Communication, vol. 23, pp. 113 to 127, 1997, and Haffner, et al., “Optimizing SVMs for complex Call Classification” In ICASSP '2003. The single level semantic classification approach is considered to be robust against noises at acoustic and semantic levels. However, the approach only provides flat semantic categories for an input, which may be adequate for limited applications, such as directory assistance, but may be too constrained for general in-depth dialog applications.