As shown in FIG. 1, during a dialog between a user 101 and a conventional spoken dialog system 140, an automatic speech recognizer (ASR) 110 processes user speech 102 to provide input 111 to a spoken language understanding (SLU) module 120. The input to the SLU can be in various forms as well known in the art. Typically, the input is a sequence of words. The words can have associated probabilities. The SLU extracts semantic information from the input. The semantic information represents intentions 121 of the user as expressed in the speech. The intentions can change as the sequence of words is progressively processed. However, when all the words in the sequence have been procesed, a goal, which sums up the intentions is determined. Based on the goal, a dialog manager (DM) 130 determines a next action 131 to be performed by the spoken dialog system.
Two key tasks in spoken dialog are user intention understanding, and user goal estimation. The SLU module extracts the intended meaning (called “intention” hereafter) of the user's speech. The DM determines the next action based on the result of the intentions. i.e., the goal.
The dialog usually includes a sequence of speech from the user and corresponding utterances and actions by the system. Intention and goal estimation takes place over a longer time scale than word understanding. The estimate of the goal can change during the dialog as more information is acquired and the intentions are clarified. Goal estimation performance is important because it can facilitate the user achieving the correct action more quickly.
The goal 121 is the input to the dialog manager 130, which represents the user's intended meaning as extracted from the user speech by the SLU module. Then, the spoken dialog system determines which action to take next based on the result of the intention understanding. The aim is to complete the dialog, which can include multiple user and system utterances/actions in a goal-oriented spoken dialog system.
Intention understanding is framed as a semantic utterance classification problem, while goal estimation is framed as a classification problem of an entire dialog. Conventional intention understanding and goal estimation can use bag of word (BoW) features, or bag of intention features in goal estimation, as inputs to a classification method, such as boosting, support vector machine, and/or logistic regression.
However, one of the problems of applying the BoW features to SLU tasks is that the feature vector tends to be very sparse. Each utterance usually has only a relatively small number of words, unlike the much larger number of words that is typically available during document analysis. Therefore, a BoW feature vector sometimes lacks sufficient semantic information to accurately estimate the user intentions.
One of the most successful neural network approaches is based on deep belief networks (DBNs), which can be viewed as a composition of simple, unsupervised networks, such as stacks of restricted Boltzmann machines (RBMs). Parameters for the RBM are used as initial values to estimate neural network parameters by a back propagation procedure. In the DBN context, the first step of determining initial parameters is called pretraining, and the second step of discriminative network training is called fine tuning.
Conventional neural network prediction and training systems are shown in FIGS. 6 and 7, respectively. As shown in FIG. 6 for prediction, a word sequence 610 is input to a network 620, and processed according to network parameters 630 to produce the user intentions and goal 621.
FIG. 7 shows the corresponding training of the network parameters 630 of the network 620 using pretrained network parameters 625 and training sequence 710.
Because of the success of deep neural network (DNN) and DBN training in ASR and image processing, other neural network architectures have been applied to SLU including Deep Convex Network, Recurrent Neural Network (RNN), and Long Short-Term Memory (LSTM) RNN.
However, in applying those techniques to SLU, one major difficulty is that often there is insufficient training data for a task, and annotating training data can be time consuming. The performance of a neural network trained in low resource conditions is usually inferior because of overtraining.
Word Embedding
Many natural language processing (NLP) systems use the BoW or a “one-hot word” vector as an input, which leads to feature vectors of extremely large dimension. An alternative is word embedding, which projects the large sparse word feature vector into a low-dimensional, dense vector representation.
There are several model families for learning word vectors, including matrix factorization methods, such as latent semantic analysis (LSA), Low Rank Multi-View Learning (LR-MVL), log-bilinear regression model (GloVe), neural network language model (NNLM) based methods, which model on local context window, such as Continuous Bag of Words (CBOW), Skip-gram, and others. Most word vector methods rely on a distance or angle between pairs of word vectors as a primary method for evaluating the intrinsic quality of word representations.
Mikolov et al. use an evaluation scheme based on word analogies, which favors models that produce dimensions of meaning, Mikolov et al., “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013. “GloVe: Global Vectors for Word Representation” shows competing results as CBOW and Skip-gram in word analogy task.
Of the above methods, GloVe, CBOW and Skip-gram are the current state-of-the-art for the word analogy task. GloVe trains on global word-word co-occurrence counts and makes efficient use of global statistics. CBOW predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word. Mikolov's toolkit ‘word2vec,’ which implement Skip-gram and CBOW, can train on large-scale corpora very efficiently.
Latent Topic Models
Latent topic models can discover semantic information from a collection of documents. Topic embedding, widely used in information retrieval, treats a document as a mixture of topics and uses a vector to represent the topic distribution. Conventional latent topic models that have been used for SLU include Probabilistic Latent Semantic Analysis (PLSA), latent Dirichlet allocation (LDA), Correlated Topic Model (CTM), and Pachinko Allocation Model (PAM), all of which use Bayesian inference to determine the distribution of latent topics. Most latent variable models are generative models, which can be used in unsupervised training.
LDA has good performance on large-scale corpus and can be trained efficiently. However, because LDA embedding is obtained with an iterative inference procedure, e.g., variational expectation minimization (EM), or sampling method, it is hard to fine-tune the LDA embedding within a neural network framework.