1. Field of Invention
The present embodiment relates generally to the field of statistics and mathematical modeling employing multivariate regression and, more specifically to using genetic algorithm to construct the independent variables composition of the multivariate regression models while optimizing one or more objectives. The objectives of these models includes but not limited to explanatory, prediction, and response measure.
2. Prior Art
Mathematical multivariate (or multi-variable) regression analysis is employed as an analytic tool for any number of reasons. One of them being the need to develop an estimate of a functional relationship, which we can use for prediction or forecasting. Another motivation for multivariate regression may be to estimate rates of change of response with respect to particular regressor variables, i.e. estimates of regression coefficients. The other reason would be explanatory; that is to extract meaning from the data.
The primary challenge in building MultiVariate Regression models, otherwise also referred to herein as MVR models, is to determine which regressor variables truly influence the response output, i.e. to determine what variables are truly relevant. The problem evolves from uncertainty of what variables to include in the model and in what combination thereof. The decision can be further complicated by the existence of multicollinearity or perhaps by the scientists' prior views and prejudices regarding the importance of individual variables. Assumptions are made regarding correctness of a postulated model when we are truly trying to find the best approximation that describes the data. Using the traditional method, a successful model builder will eventually understand that with many data sets, several models can be fit that would appear to be nearly equal in effectiveness. Thus, the problem that one deals with is the selection of one model from a pool of candidate models. Unfortunately, a human modeler could never be certain which one among all the candidate models found thus far may represent the global optima, if any at all. This is due to the fact that many of the candidates of ‘best’ model are only optimal to the subpart of the variables set, i.e. locally optimal.
Valuation on the appropriateness of a regressor variable often depends on what regressor variables are in the model combined together with it. Some combinations of the variables may in fact cause adverse results, possibly from multicollinearity or other noise, corrupting the explanatory power of the model. Thus a full scale variable screening can not be accomplished effectively by using stepwise sequential F-test or partial F-test methods although they are much quicker to evaluate and utilize less compute resources.
One would logically think that the variables evaluation should go through full permutation of all candidate variables. This is certainly true, but it is not practical in reality since the costs of computing all permutations could be far beyond the constraints of resources making it almost impossible to be accomplished. Those constraints may include but not limited to time required to evaluate all possible permutations and computing technology available. For example, to evaluate all permutations of n independent variables in a multivariate regression model, we must compute 2n model tests. Therefore, to evaluate 45 independent variables requires us to compute 245 scenarios or 35,184,372,088,832. That is over 35 billion scenarios to be computed and the costs could be further increased by the size of statistical data that needs to be processed per scenario. Imagine how long it would take to test the scenarios on 10 years of time-series data per scenario. So it is obvious why stepwise method, with its quicker evaluation time than full permutation, has become the standard methodology for building MVR models. Unfortunately, stepwise method has severe drawbacks associated with sequential F-test, partial F-test, and human prejudices. Hence, the present embodiment uses genetic algorithm instead.
In stepwise sequential method, the valuation of adding a regressor variable into the model is fully dependent on which sequence the variable is being added into the combination. A sequential process of regressor variables test starting from variables combination of (A, B) may not give the same end result if the sequence of test was to start from variables combination of (C, D) and on. One would hope that both sequential paths end up with identical variables combination of (C, F, H, I, K, R, S, U) as the optimal model, but in reality they most often do not. This problem occurs on all types of stepwise variables evaluation such as forward and backward evaluations.
While in partial testing, one immediately encounters the problem of how to segment the full list of potential variables, such as the number of variables that should be tested at any given time. Furthermore, partial testing of variables may only give the local optima result. For example, partial testing of 12 potential variables may produce models with (A, B, C, H, S, R) and (A, H, S, X, Y, Z) combinations as the pool of optimal models; which is certainly not even close to resemble the true global optimal of (C, F, H, I, K, R, S, U). This underscore the problem of presuming that 6 variables combination to be optimal, as suppose to 8 variables combination in partial variables set testing.
Another significant challenge in multivariate regression modeling is the fact that the data represents dynamic events, hence, rendering a static model functionally useless in its explanatory power to describe what is currently happening as suppose to what was happening long in the past. A static model may become over-specified or under-specified at different time in the future as the dynamics of the data shift even though it was properly specified during the time it was built. This is also known as prediction bias and it arises because the determination of the final model is uniquely related to the observation data at hand during the time the final model was built. If we are to refer back to the 45 variables evaluation example, we could not finish the evaluation before the market dynamic has shifted again. To put it another way, the challenge of building predictive models with multivariate regression is to find the appropriate combination of variables that has the most explanatory power while remaining current in its predictive accuracy within the given resources. We wish to obtain an MVR model that is adaptively accurate at reasonable amount of resources.
Genetic Algorithms belong to the class of probabilistic algorithms, but they are different from pure random algorithms since they combined both directed and stochastic search. They maintain a population of potential solutions, while most other search algorithms maintain only a single point of the search space. A single point search algorithms inherently have the problem of reaching local optima and prematurely stipulates it as global optimality. Genetic Algorithm performs a multi-directional and non-linear search by maintaining a population of potential solutions and encourages information formation and exchange between these directions. The population undergoes a simulated evolution: at each generation, the relatively “good” solutions reproduce, while the relatively “bad” solutions die. Genetic algorithms are search algorithms that are based on natural selection and genetics. They combine the concept of survival of the fittest with a randomized exchange of information. In each genetic algorithm generation there is a population composed of plurality of genomes. Those genomes can be seen as potential solutions to the problem being solved. In each successive generation, a new set of genomes is created using portions of the fittest of the previous generation. However, randomized new information is also occasionally included so that important data are not lost and overlooked.