The following relates to the information management arts, information classification and retrieval arts, data mining arts, prediction arts, and related arts.
Multi-task machine learning entails learning to predict multiple related tasks based on a common training set annotated with the multiple tasks. The resulting trained multi-task classifier or regression model finds application in numerous fields, ranging from the prediction of tests scores in social sciences, the classification of protein functions in systems biology, the categorisation of scenes in computer vision, database (e.g., Internet) search and ranking, and so forth. In many such applications, multiple related target variables (i.e., tasks) are to be predicted from a common set of input features.
To further illustrate, consider a multi-task learning problem in which it is desired to assign an image to one or more classes based on image features extracted from the image. Typically, the image features are arranged as a features vector which serves as input to a trained multi-task classifier. In this case each task corresponds to a class and decides whether the image belongs to that class, so that (by way of example) if there are twenty classes then the classifier has twenty tasks each outputting a binary value indicating whether the image should be assigned to the corresponding class. Advantageously, the multi-task classifier enables a single image to be assigned to multiple classes, where appropriate. In this case, one could envision independently training twenty single-task classifiers and applying them in parallel to perform the multi-task classification. However, this approach would lose any information that might be mined from correlations between tasks.
A closely related but more difficult multi-task learning problem is to label an image with textual keywords based on the image features extracted from the image. The problem can be made equivalent to the image classification problem by defining each available textual keyword as a “class”. However, this problem is more challenging because the number of “classes” is now equal to the vocabulary size which is typically quite large. This problem highlights the value of capturing correlations between the tasks, since keywords that are synonyms or otherwise positively related are likely to have high positive labeling correlation. For example, an image that is labeled with the keyword “flower” is more likely to be also appropriately labeled with the keyword “nature scene”. Negative correlations are also useful. For example, the image labeled “flower” is less likely to also be appropriately labeled “industrial scene”.
Thus, in multitask learning problems it would be useful to leverage task correlations (both positive and negative) in learning the multi-task classifier or regression model. However, attempting to simultaneously learn multiple tasks in a way that leverages correlations between tasks is difficult, and typically becomes problematically or prohibitively computationally intensive.
Additionally, in any learning problem it would be useful to integrate feature selection into the modeling, that is, to emphasize features that are highly discriminative while limiting or eliminating less descriminative features from consideration. A common approach is to apply feature reduction in which the discriminativeness of features is quantified and less discriminative features are discarded. In the multi-task setting, however, it is difficult to quantify the discriminativeness of a feature on a global scale. For example, a feature may be largely irrelevant for most tasks but highly discriminative for a few tasks.
One way to formulate a multitask problem is a matrix formulation in which the model is applied to the feature vector to generate the task predictions. Mathematically, this can be written as:yn=f(Wxn+μ)+εn  (1)where xnεD is an input feature vector having D features, ynεP is the vector of P task predictions, εn˜(0,Σ), f( . . . ) is a (possibly nonlinear) function, and where WεP×D is the matrix of weights, μεP is the task offsets, and εnεP is the vector residual errors with covariance ΣεP×P. In this setting, the output of all tasks, i.e. yn, is observed for every input xn. In general, it is understood that feature selection can be achieved by making the model matrix W sparse, for example in the context of a (relaxed) convex optimization framework or a Bayesian framework. An advantage of the Bayesian approach is that it enables the degree of sparsity to be learned from the data, and does not require a priori specification of the type of penalization. However, the dimensionality of the model matrix is large (WεP×D). Enforcing sparsity over this large matrix, while also leveraging correlations (both positive and negative) between tasks, is a difficult problem.