Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to the prior art by inclusion in this section.
Spoken Language Understanding (SLU) systems process language expressed by human speech into a semantic representation understandable by the machines. SLU is the key component of all conversational AI systems. The general tasks of SLU involve intent determination and slot filling from an utterance. The intent determination task can be considered as a semantic utterance classification problem, while the slot filling task can be tackled as a sequence labeling problem of contiguous words. Previous approaches to solving these two related tasks were typically proposed as two separated systems such as Support Vector Machines (SVMs) for intent determination and Conditional Random Fields (CRFs) for slot filling.
Recent advances in neural networks, especially recurrent neural networks (RNNs), allow joint training model of both intent determination and slot filling. This framework showed advantages over the previous state-of-the-art techniques, and has gained much attention in research community. The success of joint models is contributed by the attention mechanism and the encoder-decoder model. The attention mechanism allows optimized selection of input sequence for decoding for both content and location information.
In general, an SLU system is deployed as a downstream task of spoken dialogue systems where its inputs are outputs from the front-end Automatic Speech Recognition (ASR) engine. One of the tasks of an SLU system is to assign words that the ASR recognizes in the input speech of a user to slots in a slot-filling operation. As used herein, the term “slot” refers to a machine-understandable data field that is filled with one or more input words in natural language input in the SLU system. For example, one set of spoken language input to a home automation system requests activation of a heater. The input includes multiple slots including a command slot, a slot that indicates the type of device to be activated (e.g. a heater), and another slot includes a setting for the device (e.g. set the temperature to 40° C.). Once assigned to a slot, an automated system uses the input words for each slot to perform additional operations, such as operating components in a home automation system using the example provided above.
The aforementioned example uses a set of enumerable slots in which there are a well-defined number of valid inputs for each slot in a given system (e.g. well-defined sets of valid commands, automation devices, and valid numeric temperature values), but not all SLU systems can operate with slots that have a well-defined set of enumerable values. Some prior-art SLU systems use machine learning classifiers that are trained using annotated training data to recognize the slots for different words in spoken language input. However, these prior-art SLU systems can have difficulty in performing the slot-filling operation when slots can be filled with words that are not well represented in or entirely absent from the original training data. First, some types of slots may have a large or even unlimited number of possible values, so the classifiers may suffer from the data sparsity problem because the available set of training data is often limited and even large sets of training data cannot classify a large portion of the correct inputs for some types of slots. Another problem is produced by out-of-vocabulary words caused by unknown slot values (e.g., restaurant and street names), which are not practical to predefine in the training data and that are very common in real-world spoken dialogue applications. Consequently, improvements to methods and systems that increase the accuracy of spoken language understanding systems would be beneficial.