The rapid advances in hardware and software are the catalysts for tracking and storing large amounts of data. The commercial and economic benefits from gathering this data are clear; however, extracting useful information from now terabyte storage systems can be problematic.
Machine learning is becoming an integral part in many database systems that allow a computer to learn using computational and statistical methods. Log-linear models, including the special cases of Markov random fields and logistic regression, are used in a variety of forms in machine learning. The parameters of such models are typically trained to minimize an objective function. In order to achieve high generalization accuracy the parameters of a maximum-entropy classifier are regularized by optimizing an objective that is the sum of a loss term and a penalty term that favors lower-complexity models.
It is well-known that the use of regularization is necessary to achieve a model that generalizes well to unseen data, particularly if the number of parameters is very high relative to the amount of training data. One increasingly popular penalty function that is used is the L1-norm of the parameters. The L1-regularizer has the added benefit of producing “sparse” models, where most of the parameters are assigned the value zero, and can be removed from the model entirely.
The L1-regularizer has several favorable properties compared to other regularizers, such as L2. The L1-regularizer has been proven conventionally to be capable of learning good models when most features are irrelevant.
This latter property of the L1-regularizer is a consequence of the fact that the first partial derivative with respect to each variable is constant as the variable moves toward zero, “pushing” the value all the way to zero, if possible. The L2-regularizer, by contrast, “pushes” a value less and less as it moves toward zero, producing parameters that are close, but not exactly, to zero. This fact about the L1-regularizer also means that it is not differentiable at zero, and a gradient does not exist. Many algorithms for optimizing the objective depend on the existence of the gradient. Thus, it is more difficult to train such models than the more typical L2-regularizer. However, the objective function cannot be minimized with general purpose gradient-based optimization algorithms such as the L-BFGS (limited-memory Broyden-Fletcher-Goldfarb-Shanno) quasi-Newton method, which has been shown to be superior at training large-scale L2-regularized log-linear models.