Classification problems of learning discrete valued outputs occur in many applications. Classifying web pages into different classes is an important operation for many web-based operations. For example, in a search-based operation, web page classification can significantly improve relevancy.
An important aspect of a classification system or model is the training and refinement of the model itself. Often, the number of training examples belonging to different classes is not uniform, and therefore provides an imbalanced training dataset. Imbalanced training sets make it very difficult to test or refine the model because the imbalance can potentially mask model shortcomings.
Existing techniques include using Gaussian process (GP) models, which are flexible, powerful and easy to implement. In a Bayesian GP setup, latent function values and hyperparameters involved in modeling are integrated based on prior calculations. Although, the required integrals are often not analytically tractable and closed form analytic expressions are not available.
Rather, GP model selection is a problem that typically occurs in the form of choosing hyperparameters that define the model. In existing systems, the choice is made by optimizing a well-defined objective function over the hyperparameters. Two commonly used approaches are marginal likelihood or evidence maximization and minimization of leave one out cross validation (LOO-CV) based average negative logarithmic predictive probability (NLP). In these approaches, the marginal likelihood is optimized with gradient information using Laplace or Expectation Propogation (EP) approximations. In one technique for approximation an Expectation-Maximization approach for determining hyperparameters, an EP is utilized to estimate the joint density of latent function values and the hyperparameters are optimized by maximizing a variational lower bound on the marginal likelihood.
Existing techniques for generating classifier models focus on measures like marginal likelihood and average negative logarithmic predictive probability measures. These techniques, in conjunction with a LOO-CV, fail to utilize other existing measures, as these measures are not typically applied to the classification of web-based content. Rather, the existing methods of classifier model selection are very indirect and the existing solutions do not account for imbalanced problems. As such, there exists a need for a technique for selecting a classifier model including a LOO-CV, whereby the classifier model can account for an unbalanced dataset, and hence the classifier model may classify web content with an improved degree of accuracy.