Supervised machine learning techniques are employed in connection with learning parameters of respective predictive models, wherein a predictive model is configured to perform a prediction upon receipt of input data as a function of its learned parameters. Exemplary predictive models include binary classifiers (such as spam detectors that receive an e-mail and indicate whether or not such e-mail is spam), multi-class classifiers, search engine ranking algorithms, clickthrough-rate predictors, etc.
Supervised machine learning techniques are increasingly being used as black-box systems by engineers, who expect such systems to output predictive models that produce high accuracy predictions in an automated fashion. As noted above, in general, supervised machine learning techniques include the use of a learning algorithm, which learns parameters of a predictive model that cause performance of the predictive model to be optimized utilizing a training/validation data set. The learning algorithm itself has parameters, which are referred to herein as hyper-parameters. Exemplary hyper-parameters can include a learning rate of the learning algorithm, a regularization coefficient of the learning algorithm, preprocessing options, structural properties of a predictive model that is to be learned (e.g., a maximum number of leaves in a regression tree), etc.
Conventionally, values for hyper-parameters of a learning algorithm have been selected manually by an engineer as a function of experience of the engineer with the learning algorithm (e.g., the engineer may understand that a predictive model will operate satisfactorily if certain hyper-parameter values are used when training the predictive model) and/or repeated experiments on validation sets. In some instances, however, the hyper-parameter values chosen by the engineer may be significantly sub-optimal. Further, time constraints may prevent performance of a sufficient number of experiments over validation sets.
Other approaches have also been contemplated for determining hyper-parameter values for learning algorithms. One exemplary approach is a global optimization algorithm (e.g., a direct search derivative-free optimization algorithm), which can be employed in connection with identifying hyper-parameter values that cause a resultant predictive model to perform relatively well. Such direct search derivative-free optimization algorithms, however, fail to take into account that training data and validation data (and therefore values output by evaluation functions) may be stochastic (noisy) in nature. Smoothing such noise can be accomplished by averaging the results of multiple experiments by way of cross validation or bootstrapping frameworks. However, this results in much slower execution due to the need to perform many experiments at each evaluation point.