Due to large databases and sources such as the World Wide Web for example, very large amounts of data are becoming available to many people. Usually this data is not explicitly labeled. For a task such as document categorization for example, it would be useful to take advantage of the large amount of unlabeled data that is available in general, and to help improve the performance of classification and categorization systems, in particular. For example, one may desire to classify documents found on the web into one of several categories, including ‘none of the above’.
Various models have been applied to data categorization. Graphical models in one instance provide a powerful framework for approaching machine learning problems that can address many data classification problems. Two common examples are probabilistic graphical models and semi-supervised learning (SSL) on graphs, which can be referred to as Gaussian SSL. Graphs have been used as a general representation of preference relations in ranking problems and also play a role in various approaches to dimensional reduction. Probabilistic graphical models such as Bayes nets, for example, write a probability distribution as a product of conditionals, which exist on nodes, where arcs between nodes encode conditional independence assumptions. Generally, Gaussian SSL is more closely related to random walks on networks where each arc encodes the similarity between the nodes at its endpoints, and the goal is to use neighborhood structure to guide the choice of classification (or regression, clustering or ranking) function.
For Gaussian SSL models, the probabilistic interpretation is somewhat indirect; for graphical models, it is central. In either case of current probabilistic graphical models and semi-supervised learning approaches, potential problems can exist with accurately and efficiently determining data classifications or categorizations. These can include problems processing asymmetrical data sets, problems dealing with small or reduced training sets, and computational complexity thus resulting in processing inefficiency.