Multiple linear regression models are usually used to analyze the relationship between one target variable (Y) and a list of predictor variables (X). Numerous techniques, such as forward selection, backward elimination, forward stepwise, etc., have been proposed to select some predictors, which influence the target more than other predictors, out of a large set of k predictors.
A predictor may be described as a field that predicts or influences a target in a predictive regression model. A target may be described as a field that is predicted or influenced by one or more predictors in a regression model.
One way of finding the best regression is to carry out all 2k regression models based on a selected criterion, such as adjusted R square, etc. This technique is also called “exhaustive search”. When k is large, it might not be practical to carry out all possible regressions as the computing time grows exponentially with k. There have been efforts to improve performance and they are roughly along two paths: (1) utilizing sequential strategies for moving from one regression model to another regression model; and (2) utilizing parallel computing strategies to distribute the intensive computation.
The map-reduce framework has become a popular paradigm because it can handle petabytes of data in distributed data sources which are increasingly common in the internet era. The map-reduce framework enables applications to work with thousands of nodes in the distributed clusters. A typical map-reduce job usually uses multiple mappers to do computation on different data splits/blocks and one or more reducers to merge the mapper results together such that the final results/statistics are based on the whole data.