Various mechanisms are used to classify data, including linear classifiers, an n-gram classifier and a maximum entropy (MaxEnt) model. In general, a linear classifier models input data as a vector of features, and computes its dot products with the vectors of feature weights with respect to the classification classes. The class whose weight vector results in the highest dot product is picked up as the target class. A vector space model is a similarity measure used to perform comparison between two vectors; often one represents a query and the other represents a document. The similarity measure is computed via angular relationships (the normalized dot product, or cosine value) between two vectors. The document vector having the smallest difference with respect to the query vector is considered the best match. If each document is viewed as a class, then the vector space model can be viewed as a linear classification model.
An n-gram (e.g., bigram, trigram and so forth) classifier is another type of linear classifier. Given a query, an n-gram model for each classification class uses probability computations to determine the probability of the query under that class, and the n-gram classifier selects a classification class that has the n-gram language model that gives rise to the highest probability of the query.
Maximum entropy models are generally more accurate than vector space or n-gram models with respect to classification. Maximum entropy models have been used in many spoken language tasks, and also may be used for other tasks such as query classification. The training of a maximum entropy model typically involves an iterative procedure that starts with a flat (all parameters are set to zero) or a random initialization of the model parameters, and uses training data to gradually update the parameters to optimize an objective function. Because the objective function for the maximum entropy models is a convex function, the training procedure converges to a global optimum, in theory.
In practice, the convergence is defined empirically, for example, when the difference between the values of the objective function in two training iterations is smaller than a threshold. Therefore, it is not guaranteed that the model converges at the actual global optimum. Furthermore, the model's training often needs to end early, before convergence, to avoid over-training/over-fitting (e.g., giving too much weight to a mostly irrelevant term). Therefore different model parameter initializations will result in different model parameterization at the end of the training procedure and hence different classification accuracies. It is also a common practice that prior distributions with hyper-parameters are added to the objective function to prevent over-fitting.
When sufficient training data are available, the maximum entropy models are more accurate. Training a maximum entropy model thus requires a considerable amount of labeled training data. When training data are sparse, however, the vector space models are more robust.