Automated categorization of entities into homogeneous classes based on their characteristics is a critical task encountered in numerous predictive and diagnostic applications such as web page analysis, behavioral ad targeting of users, image recognition, medical diagnosis, and weather forecasting. For example, classification of web objects (such as images and web pages) is a task that arises in many online application domains of online service providers.
Classification of web pages is an important challenge. For example, classifying shopping related web pages into classes like product or non-product is important. Such classification is very useful for applications like information extraction and search. Similarly, classification of images in an image corpus (such as maintained by the online “flickr” service, provided by Yahoo Inc. of Sunnyvale, Calif.) into various classes is very useful.
Due to its ubiquitous nature, classification has been an important topic of study in statistics, pattern recognition, AI as well as the machine learning community. The standard formulation of the classification problem involves identifying a mapping f: between the observed properties of an entity xε and its class yε given a collection of labeled entities following the same distribution. Over the last few decades, a number of techniques have been proposed to address this standard classification problem and variants arising due to factors such as the nature of classes (hierarchical/multi-label classification), the amount of supervision (semi-supervised/constrained classification), the space of functional mappings (decision trees, neural networks, linear classifiers, kernel classifiers), the quality of classification (misclassification error, area under ROC curve, log-likelihood, margin), and the mode of operation (batch/incremental/inductive/transductive).
Among the classification techniques described above, the most common ones are inductive classifiers that learn a “classification model” or “classifier”, which can be deployed for labeling new entities. In spite of the huge diversity in the specific details of the learning algorithms, these inductive classifiers share a high-level paradigm for building classifiers, or in other words, a classifier development life cycle, that involves collection of labeled entities, designing and selecting features (or kernels) that capture the salient properties of the entities, learning a classification model over these features, and validating the learned model over a hold out set. Unfortunately, most of the research so far has focused only on a single phase of this development life cycle, i.e., learning the classification model, mostly because it is well-defined and does not involve significant human interaction. The other aspects of classifier development, i.e., data collection, feature design and validation, are relatively underexplored in academic research, and in practice, are predominantly human driven and executed in a trial and error manner that can be fairly tedious. For instance, creating a document (Y!Shopping) or image (Flickr) classifier typically involves multiple human participants, each focusing on one of the sub-tasks involved in classifier development, i.e., creating the documents/image corpus to be labeled, providing labels on the document/image corpus, designing relevant features, training, fine-tuning and evaluating the classifier. Each of these sub-tasks is repeated a large number of times in an ad hoc fashion with hardly any transfer of information across tasks, resulting in a lot of redundancy and requiring a long time to build a satisfactory classifier. Further more, the traditional form of input to the learning process consisting of labeled instances, each represented as a vector of pre-defined features is quite inefficient and ends up requiring a large number of iterations with different training samples and feature configurations. As a consequence, the overall classifier development process is quite time consuming in spite of the availability of fast learning algorithms.