Machine learning defines models that can be used to predict occurrence of an event, for example, from sensor data or signal data, or recognize/classify an object, for example, in an image, in text, in a web page, in voice data, in sensor data, etc. Machine learning algorithms can be classified into three categories: unsupervised learning, supervised learning, and semi-supervised learning. Unsupervised learning does not require that a target (dependent) variable y be labeled in training data to indicate occurrence or non-occurrence of the event or to recognize/classify the object. An unsupervised learning system predicts the label, target variable y, in training data by defining a model that describes the hidden structure in the training data. Supervised learning requires that the target (dependent) variable y be labeled in training data so that a model can be built to predict the label of new unlabeled data. A supervised learning system discards observations in the training data that are not labeled. While supervised learning algorithms are typically better predictors/classifiers, labeling training data often requires a physical experiment or a statistical trial, and human labor is usually required. As a result, it may be very complex and expensive to fully label an entire training dataset. A semi-supervised learning system only requires that the target (dependent) variable y be labeled in a small portion of the training data and uses the unlabeled training data in the training dataset to define the prediction/classification (data labeling) model.
Traditional active learning methods focus on querying and selecting individual samples while ignoring a structure of the data and interactions between portions of the data. In the real world, classification problems are associated with a need to label data that is structured or organized in a hierarchical way. For example, in web page classification, photos and statements associated with the photos may be grouped together. In social media, people are grouped together by certain relationships or interests. In business, multiple people and events may be involved in an issue. Using traditional active learning methods, it is very difficult to query an isolated sample while ignoring the interactions and structures of the whole group. Thus, it is more desirable to query and select based on a performance measure of a group structure.