The present disclosure generally relates to the use of regression in machine learning and data mining, and particularly to “variable selection” and in which constructs are based upon a training data set.
The broad goal of supervised learning is to effectively learn unknown functional dependencies between a set of input variables and a set of output variables, given a collection of training examples. The present application recognizes a potential synergism between two topics that arise in this context. The mention of these two topics here is made for the purpose of US law relating to information disclosure and does not constitute an admission that combination of any documents listed herein was known or obvious prior to such recognition by Applicants.
The first topic is Multivariate Regression Ildiko E. Frank and Jerome H. Friedman, “A statistical view of some chemometrics regression tools,” Technometrics, 35(2):109-135, 1993; Leo Breiman and Jerome H Friedman, “Predicting multivariate responses in multiple linear regression,”Journal of the Royal Statistical Society: Series B, (1):1369-7412, 1997. Ming Yuan and Ali Ekici and Zhaosong Lu and Renato Monteiro, “Dimension reduction and coefficient estimation in multivariate linear regression”, Journal Of The Royal Statistical Society Series B, 2007 which generalizes basic single-output regression to settings involving multiple output variables with potentially significant correlations between them. Applications of multivariate regression models include chemometrics, econometrics and computational biology.
Multivariate Regression may be viewed as a basis for many techniques in machine learning such as multi-task learning Charles A. Micchelli and Massimiliano Pontil, “Kernels for multi-task learning,” NIPS, 2004; Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil, “Convex multi-task feature learning,” Machine Learning, 73(3):243-272, 2008 and structured output prediction Elisa Ricci, Tijl De Bie, and Nello Cristianini, “Magic moments for structured output prediction, “Journal of Machine Learning Research, 9:2803-2846, December 2008; T. Joachims, “Structured output prediction with support vector machines.” Joint IAPR International Workshops on Structural and Syntactic Pattern Recognition (SSPR) and Statistical Techniques in Pattern Recognition (SPR), pages 1-7, 2006; I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. “Support vector machine learning for interdependent and structured output spaces,” International Conference on Machine Learning (ICML), pages 104-112, 2004.
These techniques are output-centric in the sense that they attempt to exploit dependencies between output variables to jointly learn models that generalize better than those learned by treating outputs independently.
The second topic includes topics such as sparsity, variable selection and the broader notion of regularization. The view here is input-centric in the following specific sense. In very high dimensional problems, where the number of input variables may exceed the number of examples, the only hope for avoiding overfitting is via some form of “capacity control” over the family of dependencies being explored by the learning algorithm. This capacity control may be implemented in various ways, e.g., via dimensionality reduction, input variable selection or regularized risk minimization. Estimation of sparse models that are supported on a small set of input variables is a strand of research in machine learning. It encompasses 1-1 regularization (e.g., the lasso algorithm of R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society, Series B, 58:267-288, 1994 and matching pursuit techniques; S. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Transactions on Signal Processing, 1993 that come with theoretical guarantees on the recovery of the exact support under certain conditions. Particularly pertinent to this invention is the notion of structured sparsity. In many problems involving very high-dimensional datasets, the prior knowledge that the support of the model should be a union over domain-specific groups of features is enforced. Several methods have been recently proposed for this setting. For instance, M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society, Series B, 68:49-67, 2006 P. Zhao and G. Rocha and B. Yu, “Grouped and hierarchical model selection through composite absolute penalties” Technical report, 2006 extend the Lasso formulation to this context, while the methods of A. C. Lozano, G. Swirszcz, and N. Abe, “Grouped orthogonal matching pursuit for variable selection and prediction,” Advances in Neural Information Processing Systems 22, 2009; J. Huang, T. Zhang, and D. Metaxas, “Learning with structured sparsity,” Proceedings of the 26th Annual International Conference on Machine Learning, 2009 extend matching pursuit techniques.
The present disclosure treats very high dimensional problems involving a large number of output variables. It is desirable to address sparsity via input variable selection in multivariate linear models with regularization, since the number of parameters grows not only with the data dimensionality but also the number of outputs.