The task of language modeling is to estimate the likelihood of a word string. This is fundamental to a wide range of applications such as speech recognition and Asian language text input.
The traditional approach to language modeling uses a parametric model with maximum likelihood estimation (MLE), usually with smoothing methods to deal with data sparseness problems. This approach is optimal under the assumption that the true distribution of data on which the parametric model is based is known. Unfortunately, this assumption rarely holds in realistic applications.
An alternative approach to language modeling is based on the framework of discriminative training, which uses a much weaker assumption that training and test data are generated from the same distribution but the form of the distribution is unknown. Unlike the traditional approach that maximizes a function (i.e. likelihood of training data) that is loosely associated with the error rate, discriminative training methods ideally aim to minimize the same performance measure used to evaluate the language model, namely the error rate on training data.
However, this ideal has not been achieved because the error rate of a given finite set of training samples is usually a set of discrete values that appear as a step function (or piecewise constant function) of model parameters, and thus cannot be easily minimized. To address the problem, previous research has concentrated on the development of a loss function that provides a smooth loss curve that approximates the error rate. Using such loss functions adds theoretically appealing properties, such as convergence and bounded generalization error. However, the minimization of a loss function instead of the error rate means that such systems are optimizing a different performance measure than the performance measure that is used to evaluate the system that a language model is applied in. As a result, training the language model to optimize the loss function does not guarantee that the language model will provide a minimum number of errors in realistic applications.