The generalized linear model (GLM), and specifically the GLM subclasses linear and logistic regression, are an important set of statistical models. GLMs are constructed for datasets that include input attributes and a target attribute that is the subject of the modeling process. GLM extends the methods of ordinary linear regression to target attributes that are not necessarily normally distributed with constant variance over their range, such as counts, or membership in a category. The target attribute is connected to a linear response via a link function and the variance can be specified as a function of the predicted mean. The datasets can be large and include many input attributes. In addition candidate features can be constructed from the input attributes and used to augment the input attributes used by the modeling process to predict the target attribute. Features are functions of the input attributes such as products and powers of input attributes. GLM has broad application as both a descriptive and predictive tool across many industries including epidemiology, finance, economics, marketing and environmental science. The wide applicability of GLM is due to its simplicity and interpretability, including a well-used and well-studied group of diagnostics.
As the size of a dataset being modeled increases, GLM suffers significant drawbacks. GLM, in its standard form, is computationally intensive, with approximately cubic scaling. The number of possible multi-attribute combinations explodes as the number of attributes increases. For example, two hundred input attributes yields 40,000 pair-wise candidate features and eight million triplet combination features. Multicollinearity, in which one or more attributes are highly correlated, causes numerical instability in the GLM. In the absence of specific efforts to avoid it, the likelihood of encountering multicollinearity increases with the number of attributes. Furthermore, interpretability of the GLM declines as the number of attributes increases.