Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to the prior art by inclusion in this section.
Spoken Language Understanding (SLU) systems process language expressed by human speech into a semantic representation understandable by the machines. SLU is the key component of all conversational AI systems. The general tasks of SLU involve intent determination and slot filling from an utterance. The intent determination task can be considered as a semantic utterance classification problem, while the slot filling task can be tackled as a sequence labeling problem of contiguous words. Previous approaches to solving these two related tasks were typically proposed as two separated systems such as Support Vector Machines (SVMs) for intent determination and Conditional Random Fields (CRFs) for slot filling.
Recent advances in neural networks, especially recurrent neural networks (RNNs), allow joint training model of both intent determination and slot filling. This framework showed advantages over the previous state-of-the-art techniques, and has gained much attention in research community. The success of joint models is contributed by the attention mechanism and the encoder-decoder model. The attention mechanism allows optimize selection of input sequence for decoding for both content and location information.
In general, an SLU system is deployed as a downstream task of spoken dialogue systems where its inputs are outputs from the front-end Automatic Speech Recognition (ASR) engine. The errors in word sequences generated by ASR engine cause the performance degradation of intent detection and slot filling. In most real-world applications (e.g., far field with noises and reverberation effect), such errors are still unavoidable even with deployment of more robust ASR techniques.
The real-world performance of slot filling and intent detection task generally degrades due to transcription errors generated by speech recognition engine. The insertion, deletion, and mis-recognition errors from speech recognizer's front-end cause the misinterpretation and misalignment of the language understanding models. Various error sources including, but not limited to, noisy environments can increase the error rates of even state of the art automated speech recognition systems, and these errors negatively affect the accuracy of SLU systems. Consequently, improvements to methods and systems that increase the accuracy of spoken language understanding systems would be beneficial.