Natural language understanding (NLU) refers to the technology that allows computers to understand, or derive meaning from, written human languages. In general, NLU systems determine meaning from text. The meaning, and potentially other information extracted from the text, can be provided to other systems. For example, an NLU system used for an airline can be trained to recognize user intentions such as making a reservation, cancelling a reservation, checking the status of a flight, etc. from received text. The text provided to the NLU system as input can be obtained from a speech recognition system, keyboard entry, or some other mechanism. The NLU system determines the meaning of the text and typically provides the meaning, or user intention, to one or more other applications. The meaning can drive business logic, effectively trigging some programmatic function corresponding to the meaning. For example, responsive to a particular meaning, the business logic can initiate a function such as creating a reservation, cancelling a reservation, etc.
A classifier functions as part of an NLU system. At runtime, the classifier receives a text input and determines one of a plurality of classes to which the text input belongs. The classifier utilizes a statistical classification model (statistical model) to classify the text input. Each class corresponds to, or indicates, a particular meaning. For example, a text input such as “I would like to book a flight” can be classified into a class for “making a reservation.” This class, and possibly other information extracted from the text input, can be passed along to another application for performing that action.
The statistical model used by the classifier is generated from a corpus of training data. The corpus of training data can be formed of text, feature vectors, sets of numbers, or the like. Typically, the training data is tagged or annotated to indicate meaning. The statistical model is built from the annotated training data. Often, training data includes one or more outlier portions of text. “Outlier text”, or simply an “outlier,” can refer to a portion of text that specifies a less common, or less orthodox, way of expressing an intention or meaning in a written human language.
Both outliers and non-outliers must be reliably processed by a classifier. Accordingly, outliers are commonly included within training data in an effort to adequately train the statistical model. Conventional techniques for generating statistical models, however, do not handle outliers in the most efficient or accurate manner. Often, the inclusion of outliers within training data does not lead to a statistical model that can reliably classify outliers. Moreover, the resulting statistical model, in many cases, classifies non-outlier text input with less certitude. For example, the confidence score associated with a classification result for a non-outlier typically is lower than otherwise expected. Generally, a confidence score indicates the likelihood that the class determined for a given text input by the classifier using the statistical model is correct.