Deep neural network models may contain millions of parameters that extract hierarchies of features from data, enabling them to learn from a large amount of data compared to earlier shallow networks. However, deep neural networks often suffer from overfitting and a lack of generalization due to their large capacity. This may result from learning stages throughout the model training process. Due to the nature of deep neural networks, models may learn based on (i) connecting input and output labels by extracting predictive features from the input; (ii) statistics associated with output labels (e.g., likelihood of the output itself); and (iii) connecting non-predictive features in the input to output labels. It is desirable that models focus on the predictive features of (i) and avoid learning from non-predictive aspects (ii) and (iii). Structuring model training processes so the model learns in this way has proven difficult, as deep neural networks typically maximize the conditional probability P(y|x) of the output (y) given input features (x), instead of maximizing mutual information, P(y|x)/P(y) between the output and input.
Stochastic gradient boosting has been used in machine learning to combine the capacity of multiple shallow or weak learners to form a deep or strong learner. A data set may be split among multiple weak learners, and weak models may specialize on fractions of the data set. Application of stochastic gradient boosting to an ensemble of decision trees is described in J. Friedman, “Greedy Function Approximation: A Gradient Boosting Machine,” The Annals of Statistics, Vol. 29, No. 5, 2011, which is incorporated herein by reference. But stochastic gradient boosting has been considered infeasible for application to training deep neural networks. It has been observed that application of Friedman's stochastic gradient boosting to deep neural network training often led to training instability. See, e.g., Philip M. Long, et al, “Random Classification Noise Defeats All Convex Potential Boosters,” in Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 2008. Since deep neural networks are strong learners by design, model gradients are generally not boosted during computation as it has been seen as computationally prohibitive. And other gradient descent-based boosting algorithms suffer from a labelling noise problem that hinders model training.
Aspects described herein may address these and other problems, and generally improve the quality, efficiency, and speed of machine learning systems by offering improved model training through regularizing model training, improving network generalization, and abating the deleterious effect of class imbalance on model performance.