Language models are indispensable for large-vocabulary continuous-speech recognition. The models, which are usually based on n-grain statistics, provide prior probabilities of hypothesized sentences to disambiguate their acoustical similarities. To construct an n-grain model, text corpora are used to estimate the probability of a word's occurrence conditional on the preceding n−1 words, where n is typically 3 or 4.
On the other hand, continuous space language models based on neural networks have attracted increased attention in recent years. With that approach, word indexes are mapped to a continuous space and word probability distributions are estimated as smooth functions in that space. Consequently, that approach makes it possible to provide better generalization for unrecognized n-grams.
A recurrent neural network language model (RNNLM) is an instance of such continuous space language models. The RNNLM has a hidden layer with re-entrant connections to itself with a one word delay. Hence, the activations of the hidden units play a role of memory keeping a history from the beginning of the speech. Accordingly, the RNNLM can estimate word probability distributions by taking long-distance inter-word dependencies into account.
In addition, more advanced RNNLMs, Long Short-Term Memory (LSTM) RNNs are used in language modeling for speech recognition, which can characterize longer contextual information than the conventional RNNLMs by handling the memory with several gating functions, and improves the recognition accuracy.
In most cases, RNNLMs are trained to minimize a cross entropy of estimated word probabilities against the correct word sequence given a history, which corresponds to maximizing the likelihood for given training data. However, this training does not necessarily maximize a performance measure in a target task, i.e., it does not minimize word error rate (WER) explicitly in speech recognition. For n-gram-based language models, several discriminative training methods are known to solve this problem, but those for RNNLMs have been insufficiently investigated so far. A hidden activation vector of the RNNLM can be added to the feature vector for a log-linear language model. In addition, a cross entropy criterion can be modified based on word confidence measure.
Discriminative training methods are widely used in speech recognition, where acoustic or language models are trained to optimize their parameters based on a discriminative criterion. Unlike the maximum likelihood approach, those methods can improve discriminative performance of models by taking a set of competing hypotheses for each training sample into account.
In speech recognition, a hypothesis means a word sequence inferred by an ASR system for a given utterance. ASR systems find multiple hypotheses for an input utterance and select the best-scored hypothesis among them, where each hypothesis is scored with its probability obtained by the acoustic and language models. In discriminative training, the multiple hypotheses are usually used to train the models based on a discriminative criterion.
In language modeling, n-grain probabilities are directly optimized with a minimum classification error criterion, and log-linear language models with n-gram features are trained with a perceptron procedure, reranking boosting, and minimum word error rate training. Because those methods are designed for n-gram models or n-gram-feature-based models, they cannot be used directly for neural network-based language models including RNNLMs. Another method uses a hidden activation vector of an RNNLM as additional features for a log-linear language model. However, the RNNLM itself is not trained discriminatively.
A discriminative training method for RNNLMs uses a likelihood ratio of each reference word to the corresponding hypothesized word is used instead of the cross entropy. However, that method does not sufficiently exploit the potential ability of discriminative training with regards the following reasons;                It considers only one competitor for each reference word, where the competitor is a hypothesized word in the 1-best ASR result.        In general, it is better to consider multiple competitors in discriminative training.        It is not a sequence training because word-to-word alignment is fixed during the training. This means that inter-dependence of word errors is ignored.        It does not directly minimize word error rate that is the ASR performance measure.        