Business analysts like to know which factors (e.g., categorical predictors) impact a target variable of interest and by how much the factors impact the target variable. A target variable may be described as a field that is predicted or influenced by one or more of the factors in a model. A categorical predictor may be described as a field that has a finite number of nominal or ordinal categories as values.
A linear regression model may be used to answer such questions from business analysts. Furthermore, in many business scenarios, the interaction between factors may be relevant.
An Analysis of Variance (ANOVA) technique works in linear regression models that assume the target variable follows a normal distribution and the linear relationship exists between the target variable and factors, but the ANOVA technique is not applicable in more general models.
As an example, a software company wants to determine which characteristics of customers will affect their decision to buy or not to buy a product. For this example, a logistic regression model is more appropriate because the target variable (buy or not to buy a product) is binary, a Bernoulli distribution is used, and the mean of the target variable is to be between 0 and 1 (so a function of the target variable mean is assumed to be linearly related to factors, which is called a “logit link function”).
As another example, if or when a car insurance company wants to analyze which factors contribute the most to customer's claim size, then a seasoned analyst knows to fit a gamma regression to damage claims for cars because it is more appropriate to the analysis of positive range data by using a gamma distribution and an inverse link function to relate the mean of the target variable to a linear combination of the factors.
In a further example, a shipping company is concerned about damage to cargo ships caused by waves and wants to determine which factors (such as ship types, years of construction, etc.) are more prone to damage, then the incident counts are modeled as occurring at a Poisson rate and a log-linear model (with a Poisson distribution and a log link function) is used.
Many such general models belong to so called “generalized linear models”. The generalized linear model expands the linear regression model so that the target variable is linearly related to the predictors via a specified link function. Moreover, the generalized linear model allows for the target variable to have a non-normal distribution.
Because the ANOVA technique is not applicable in generalized linear models, a likelihood ratio test may be used to detect interaction. The likelihood ratio test compares log-likelihood values between the full and reduced generalized linear models. For a two-way interaction, the full model includes two factors (also called “main effects”) and an interaction effect, while the reduced model includes two main effects (without an interaction effect). Computation of log-likelihood value in the reduced model is an iterative process and requires many data passes.